Click to learn more about author Ravi Shankar.
Back in the 1980s, when databases first enabled us to store megabytes of data, we were enthralled because we could store what seemed to be large volumes of data in a single place. Since then, we continue to try to collect our data in a single place for easier access. Are we employing the right strategy, or are we chasing a specter?
Originally, we thought that we had data repositories large enough to meet our needs, but we actually had multiple databases and ERP systems and, once again, we yearned for a single place to store all of our data. Finally, in the 1990s, we found it with the data warehouse, and operational data store.
So Far, So Good
However, in the 2000s, Cloud systems and social media platforms presented new kinds of sources that couldn’t be accommodated by these traditional data warehouses and/or operational data stores. Cloud systems presented volumes that were just not economically feasible to store, and social media platforms presented unstructured data that could not be readily consumed.
Once again, there was a need for a single place that could store all of this data. This time was different though, as we required something that was bigger than a data warehouse could handle and in the 2010s it arrived: Big Data. Now, data lakes can store all the data from cloud sources, social media platforms, and even the data warehouse and operational data store.
But There’s a Problem
The data now has the potential to be stored in one place, but different lines of business tend to create data lake repositories that are suited for their own purposes. At the corporate level, data is still distributed across multiple data lakes, on-premises systems, and other cloud-based sources, and we are back to where we started.
Once again, do we need a bigger, all-encompassing repository? Do we need to establish universal processes for transforming all of the data so that it can be accessed by more diverse consumers? Possibly, but would any such process be sustainable as data volumes multiply exponentially, and as we need to consume this data with greater and greater speed?
We Need to Stop
We need to stop trying to collect all of our data and store it in a single monolithic place, because data volumes will only get bigger, and the speed of data will only accelerate. In short, we need to stop collecting the data completely, and start connecting to the data wherever it resides.
The Power of a Single Virtual Repository
The way to seamlessly connect to multiple distributed data sources is to establish a virtual data repository. A virtual repository doesn’t contain any actual data, but contains “pointers” to the data wherever it may be stored. Virtual repositories are enabled by data virtualization, which employs metadata to provide access to disparate data sources, abstracting consumers from the complexities of access. In fact, to consumers of the data, all of the data sits in a single repository. Because a virtual repository has no actual data, it can “grow” to as large as it needs to be.
Architecturally, a single virtual data repository has additional benefits in addition to scalability. Because all of the data is accessed through a single layer, a virtual repository provides an opportunity to apply security controls, or any governance controls that a company wishes to implement, across the entire infrastructure from a single point. And while keeping track of data source details on behalf the user, abstracting users from this information, virtual repositories also track end-to-end data lineage.
For years, we’ve been chasing the specter of a single repository, and now it’s time to stop chasing the ghost. Instead, we need to realize that it’s more feasible to simply connect to the data rather than to collect it.