One of the biggest pitfalls companies can run into when establishing or expanding a data science and analytics program is the tendency to purchase the coolest, fastest tools for managing data analytics processes and workflows, without fully considering how the organization will use these tools. The problem is that companies can spend much more money than they need to if they just chase speed, and end up with a brittle data infrastructure that’s challenging to maintain. So the question is, how fast is fast enough? We’re always told that time is a finite resource, one of the most valuable resources, but sometimes what you have to spare is actually time.
A common misconception about data for machine learning is that all data needs to be streaming and instantaneous. Triggering data needs to be real-time, but machine learning data doesn’t need an instant response. There’s a natural human tendency to choose the fastest, most powerful solution available, but you don’t need a Formula 1 race car to go to the grocery store. And the fastest solutions can be the most expensive, delicate, and hardware-intensive options. Companies need to look at how often they make decisions based on model outputs and use this cycle time to inform how they manage their data. They need to look at how fast they need that data based on how often the data will be used to make a business decision.
The phrase “real-time” is similar to “ASAP,” in that it can have fairly different meanings depending on the situation. Some use cases require updates in a second, others in minutes, hours, or even days. The deciding factor is if humans or computers are using the data. Look at a retail site showing similar items to a shopper on a page. The site needs to analyze what the user clicked on to display related products, and surface the products in the time it takes to load a web page. So this data really does need to be evaluated in real-time, similar to the data feeding a credit card fraud algorithm, or an automated stock trading model – all computer-based decision models with little human input while the model is running.
For situations where humans are acting on the data, companies can save significant costs and resources by batch processing this data every hour or so. Sales teams reviewing their weekly status don’t need to know the exact second when someone asks for more information – they can get those updates after a few minutes of batching and processing (or even a few hours).
Real-time vs. batch processing isn’t mutually exclusive: Sometimes, companies will need instant, unvalidated data for a quick snapshot, while using a separate stream to capture, clean, validate, and structure the data. Data in a utility company could feed several different needs. For customers monitoring their energy usage moment by moment, an unprocessed stream tracking real-time electricity usage is essential. The utility accounting system would need to look at data every hour, to correlate with current energy prices. And data for end-of-the-month billing needs to be thoroughly vetted and validated to ensure outlying data points or inaccurate readings don’t show up on customer bills. The more analysis, the bigger the picture, and the more important clean, validated, and structured data is to the data science team.
When companies are looking at how they use data to make decisions and evaluating if “real-time” is really necessary, there are a few steps to guide this analysis.
- Utilize outcomes-based thinking: Look at the process of data ingestion and analysis, how often a decision is made, and is it a computer, a person, or even a group of people that are making the decisions. This will guide you on how quickly to process the data. If humans are part of the downstream actions, the whole process is going to take hours or even weeks. In this scenario, making the data move a few minutes faster won’t have a noticeable impact on the quality of decisions.
- Define “real-time”: What are the tools that work well for this function? What are your requirements in terms of familiarity, features, cost, and reliability? This review should point to two or three systems that should cover your needs for both real-time and batched data. Then look at how these tasks correlate with the needs of different teams, and the capabilities of different tools.
- Bucket your needs: Think about who the decision-maker is in this process, the frequency, and the maximum latency allowable in the data. Look at what processes need quick unprocessed data, and what needs a more thorough analysis. Pay attention to the natural bias for “racetrack” solutions, and frame the tradeoffs in expenses and maintenance needs. Separating these needs may sound like more work top-down, but in practice, this saves money and makes each system more effective.
- Outline your requirements: Look at each stage of the process, and figure out what you’ll need to extract from the data, how you’ll transform it, and where to land the data. Also, look for ways to land raw data before you even start transformations. A “one-size” approach can actually add more complexity and limitations in the long run. The Lambda architecture is a great example of a platform that has the consumption journey of first building a modern, batch-time warehouse, and then later adding a real-time streaming service.
- Evaluate the complete latency/cycle time for processing data: Latency in data movement is only one contributor to the total time it will take to get results back, there is also processing time along the journey. Track how long it will take between logging an event, processing, and potentially transforming that data, running the analytics model, and presenting the data back. Then utilize this cycle time to evaluate how quickly you can (or need) to make decisions.
Managing all the requirements of a data science and analytics program takes work, especially as more departments within a company depend on the outputs of machine learning and AI. If companies can take a more analytical approach to defining their “real-time,” they can meet business goals and minimize costs – while hopefully providing more reliability and trust in the data.
Think of this difference between real-time and batched data as similar to how an Ops team works. Sometimes they need real-time monitoring to know when an instance fails as quickly as possible, but most of the time, the Ops team is digging into the analytics, analyzing the processes, and taking a deeper look at how a company’s IT infrastructure is working – looking at how often an instance fails instead. This requires more context in the data to create an informed analysis.
Ultimately, one size does not fit all for data science. Engineering skills and qualified analysts are too rare and valuable. People, compute, and storage: These things are all rare and valuable and should be used judiciously and effectively. For once, “time” can be the resource you have more of than you need.
The downside of relying on real-time everywhere is often failure. There are too many complexities, too much change, too many transformations to manage across a complete pipeline – and IT firm Gartner says between 60-85% of IT data projects fail. If a company wants to structure its complete data infrastructure around real-time, they need to create a “Formula 1 pit crew” to manage their systems. But people may be disappointed with the high expenses of a real-time program set up to create routine updates.
If a company looks at what’s most valuable in their data, which data needs immediate action and which is more valuable in the aggregate, and how often the company acts on that data, enterprises can maximize the rare resources of people and systems – and not waste time by moving faster than the business.