Click to learn more about author Gary Orenstein.
Introduction
Every day we see freshly created data streams. From new mobile apps, to new connected appliances and automobiles, to new business applications capturing information from all corners of the world, we are awash in data.
This dramatic trend is perhaps best illustrated in the skyrocketing valuations of the world’s most valued companies and the data they shepherd to grow their businesses. Witness Apple and the AppStore, Google and its history of searches, Facebook and its social graph, and Amazon with a rich purchase history of every customer. Each of these companies has found a way to both generate and monetize their Data Corpuses.
For the broader industry, how to harness data, determine what’s useful or simply digital dust, and make it actionable remains a defining technology and business challenge of our time.
Many companies across industries already have data collection and analysis models in place, but now is the time to evaluate how effectively these approaches impact the business.
Developing a Data Corpus
An individual company’s Data Corpus can come from a variety of sources:
- Existing Internal Data Assets: This may be data already stored within the enterprise, perhaps accessible to the whole company, perhaps only to a few individuals or teams. Today many companies seek to put historical data into a Data Lake, essentially low-cost storage for historical data, and make that accessible more broadly within the company.
- Existing External Data Assets: Not every data corpus needs to be built with only internal data. Consider Google in the early days who built its own data corpus by scanning the web. Another option in the financial sector is to purchase relevant historical datasets.
- New Data Streams. Unrealized and untapped data streams may be easily accessible. This could be through collecting new data or speeding up data collection. For example, a large retailer took existing web statistics, and instead of batch processing them overnight, it moved to a real-time data pipeline. This provided immediate visibility to general merchandise managers to make intraday decisions instead of waiting overnight.
Assessing Data Corpus Value
Retaining data is not free, and does have fixed and marginal costs. However, the costs of retaining data are generally far less than the value businesses can derive. For example, the first 50TB in Amazon Web Services S3 service costs $1150 per month for regular access, and just $200 month in low frequency access to ‘Glacier’ storage. Yes, there are upload and download costs, which can inflate storage expenses, but the general idea is that storage is relatively cheap.
The larger question is making use of that data. A quick survey of data users and constituents within a company can help answer the question. Most large companies have dozens or sometimes hundreds of business intelligence analysts. Would they benefit from access to more data, new data streams, and fresher data? The answer is almost always yes.
Delivering Value from the Data Corpus
Value from a Data Corpus comes from enabling new insights and applications. Take the simple example of Google and its suggestions based on searches. Once Google attracted the vast majority of web searches, it could build a database of the most frequent requests, so when you begin a search request on Google, it autocompletes.
Figure 1: The ‘autocomplete’ feature in Google
Financial Sector Example
Every large financial institution needs to track assets to ensure it is operating within set corporate and government compliance regulations. Essentially if the bank can see that it is well within its compliance window, it can make more aggressive investments, with the potential for a higher rate of return. If it is near or at the compliance threshold, the bank must maintain more conservative investments and in turn, receive a lower rate of return.
Without a singular Data Corpus that has all the data to generate these reports, the bank is flying blind. By incorporating disparate systems into a large real-time Data Warehouse banks can get a collective lens on all operations and with real-time feeds, ensure that the information is up-to-date within the day instead of overnight.
Achieving Digital Transformation with Real-Time Feedback Loops
The most successful companies take the data corpus and apply a combination of Data Science and Machine Learning to drive insight back into the Data Corpus, as shown in Figure 2.
Figure 2: Bringing applications and Data Science together for real-time Machine Learning. Source: MemSQL
The original beneficiaries of the Data Corpus were applications, essentially the ‘actors’ of the technology industry. These applications generated a moderate amount of data, and a fixed set of interactions between the data and the application.
With applications and devices driving larger volumes of data, we brought operators to apply data science to enhance experiences across everything from enterprise software to mobile apps. When actors and operators are brought together, real-time Machine Learning can be applied to drive new knowledge back into the business. But the real magic takes place when a feedback loop is developed to enrich the experience as shown in Figure 3.
Figure 3: Building a feedback loop for the data corpus to drive Digital Transformation. Source: MemSQL
Going Big with the Data Corpus
There is no question data is the new fuel for business. Companies now need to tackle data challenges with solutions that can provide the ability to monetize the data. With that in mind, enterprise architects should investigate:
- Database and Data Warehouse solutions that can store both large volumes of historical and real-time data
- Data stores that can ingest data in real-time while providing the ability to query that data
- Solutions that can incorporate real-time Machine Learning scoring, and the ability to embed Machine Learning functions in the datastore
- Datastores that have both a transactional capability to capture events and an analytical capability to provide real-time insights. This provides a full solution with fewer overall systems
We are likely to see a focus on the Data Corpus for some time to come. As a one prominent Data Scientist and AI startup CEO noted: