Cloud-native repositories for big scientific data

Ryan Abernathey; Tom Augspurger; Anderson Banihirwe; Charles C. Blackmon-Luca; Timothy Crone; Chelle Gentemann; Joseph Hamman; Naomi Henderson; Chiara Lepore; Theo McCaie; Niall Robinson; Richard P. Signell

doi:10.1109/MCSE.2021.3059437

Cloud-native repositories for big scientific data

Computing in Science and Engineering

By: Ryan Abernathey, Tom Augspurger, Anderson Banihirwe, Charles C. Blackmon-Luca, Timothy Crone, Chelle Gentemann, Joseph Hamman, Naomi Henderson, Chiara Lepore, Theo McCaie, Niall Robinson, and Richard P. Signell

https://doi.org/10.1109/MCSE.2021.3059437

Links

More information: Publisher Index Page (via DOI)
Open Access Version: Publisher Index Page
Download citation as: RIS | Dublin Core

Abstract

Scientific data have traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow toward the petabyte scale. A “cloud-native data repository,” as defined in this article, offers several advantages over traditional data repositories—performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access and inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing’s full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.

Additional publication details
Publication type	Article
Publication Subtype	Journal Article
Title	Cloud-native repositories for big scientific data
Series title	Computing in Science and Engineering
DOI	10.1109/MCSE.2021.3059437
Volume	23
Issue	2
Year Published	2021
Language	English
Publisher	IEEE
Contributing office(s)	Woods Hole Coastal and Marine Science Center
Description	10 p.
First page	26
Last page	35
Google Analytic Metrics	Metrics page