Data Versioning Matters to Data Science

Amazon Web Services (AWS) recently published a case study about how the Allen Institute for Cell Science — which was founded by Microsoft’s Paul Allen to research how the human brain works in health and disease — is taking new steps to make its data and metadata easy to access, search, and redistribute for internal and external users on the web.

The Institute has a lot of data: more than 7 terabytes and over 288,000 objects on the Amazon S3 web service. And more and more data is being added every day. In 2018, it doubled the number of mouse cells in the Allen Cell Types Database, from just over 1,000 to slightly more than 1,900 cells, and also added new neuronal connectivity information to the Allen Mouse Brain Connectivity Atlas.

Its data resources — including large-scale 3D imaging data, predictive models of cell organization, gene-edited human stem cell lines, and data analysis tools — are publicly available and are accessed by thousands of researchers and other interested parties every year.

Here are some examples of how organizations and individuals make use of the resources:

Researchers at Boston University’s Tissue Microfabrication Laboratory used the human stem cell data to study how sarcomeres — a structure that gives heart muscle cells the ability to contract and pump blood — are generated in cardiomyocytes (heart cells).
The National Eye Institute is using the organization’s cell collection to help develop therapies for vision loss caused by age-related macular degeneration.
The cell biology curriculum at one high school is incorporating using the Institute’s 3D Cell Viewer for students to investigate any of 40,000 different 3D images of human stem cells.

The Institute’s data sets are available via the AWS Open Data Registry, which covers the cost of storage for publicly available, high-value, cloud-optimized datasets. But importing data and files to different systems was a struggle for users.

Get the Version Right

According to the Institute, a big part of the issue is data versioning, which is important to seamless reproducibility as users take advantage of integration with Data Science tools such as Python and Jupyter. Versioning keeps multiple variants of an object in the same Amazon S3 bucket and can be used to preserve, retrieve, and restore every version of every object stored there.

A recent survey on the future of data collaboration conducted by Pulse Q&A notes that one-third of respondents found that decision quality frequently was harmed by collaborators having incomplete access to all data, and 65 percent agreed that data version control will play an essential role in their decision quality, reproducibility, and auditing processes within the next five years, enabling everyone to look at the same data.

One-third of the executives questioned are also poised to use Amazon S3 to store critical data for machine learning, analytics, and decision support, according to the survey, which was sponsored by Amazon Marketplace partner Quilt Data. As of now, Quilt’s open data store of the world’s public data in S3 encompasses 10.2 billion objects, 3.7 petabytes of data and 25 S3 buckets. That includes Amazon’s Registry of Open Data. That number of objects is greater even than the number of websites.

The company turns S3 buckets into versioned data sets for sharing, discovering, and modeling data at scale; data scientists can store and version their Jupyter notebooks and all their data dependencies and share their notebooks and machine learning projects in a reusable format. To summarize data visually, Quilt offers 20 types of interactive visualizations to embed inside large data sets.

It also lets users have a view into large data sets without having to download them. People can work with large data sets even on a mobile phone:

“There are disparate patches of data out there but coming together they tell a compelling story, and you get a more accurate picture of the world,” said Quilt co-founder Aneesh Karve in a recent DATAVERSITY® interview.

Quilt leaves data in the S3 bucket under the owner’s control, which is important for privacy and legal reasons. Amazon’s Identity and Access Management tool (IAM) for securely managing access to AWS services and resources is automated by Quilt.

Companies have other versioning options they can turn to, too. They can manually do version control or use CVS, Git, Mercurial, or Subversion (SVN) to keep track of file changes and who made them, and to synchronize them so that all contributors can manage changes to the same set of files. Dropbox includes version control capabilities, too. Also available is DVC, an open source version control system from DVC.org that is designed to handle large files, data sets, machine learning models, and metrics as well as code and to make machine learning models shareable and reproducible. With some of these options, the amount of data that people can share publicly for free is limited.

The Allen Institute uses Quilt so that its scientific data is versioned to give users a common frame of reference, and to use its machine learning tooling to automate labeling for its imaging collection. The Institute is working with Quilt and AWS to make more of its datasets available via the AWS Open Data Registry.

Organizations can use open.quiltdata.com for free and get unlimited access to publishing their public data and for discovering public data, or they can select a monthly subscription option on Amazon, or an enterprise option for a private ecosystem. Right now Amazon has the bulk of datasets, which is why Quilt selected to partner with them first. But Quilt is multi-cloud friendly and as the company grows, it expects to be Azure- and Google Cloud-compatible.

Image used under license from Shutterstock.com

LEARN MORE ABOUT OUR PRIVATE CDMP TRAINING

Data Topics

Data Versioning Matters to Data Science

Leave a Reply Cancel reply