When it comes to data sources, analytic apps developers are facing new and increasingly complex challenges, such as having to deal with higher demand from event data and streaming sources. Here in the early stages of this “stream revolution,” developers are building modern analytics applications that use continuously delivered real-time data. Yet while streams are clearly the “new normal,” not all data is in streams yet – which means the “next normal” will be stream analytics. It’s therefore incumbent on every data professional to understand the ins and outs of stream analytics.
In the new normal of streams, you’ll need to incorporate stream data into any analytics apps that you’re developing – but is your database actually ready to do it? While databases may claim they can handle streams, it’s important to understand what their true capabilities are. Simply being able to connect to a streaming source such as Apache Kafka isn’t enough; it’s important to know what happens after that connection has occurred. If a database processes data in batches, persisting data to files before a query can be made, this is insufficient for real-time insights, which require delivery of insights faster than batch loading can deliver.
A database built for streaming data requires true stream ingestion. You need a database that can handle high and variable volumes of streaming data. Ideally, your database should be able to manage stream data with native connectivity, without requiring a connector. Besides handling stream data natively, look for these three other essentials for stream analytics that can ready your analytics app for real-time:
1. Event-by-Event vs. Batch Ingestion
Cloud data warehouses including Snowflake and Redshift – as well as certain databases such as ClickHouse that are considered high performance – ingest events in batches before persisting them to files where they can then be acted on.
This creates latency – the multiple steps of stream-to-file-to-ingestion takes time. A better approach is to ingest stream data with each and every event placed into memory, where it can then be queried immediately. This form of “query on arrival” makes a big difference in use cases like fraud detection and real-time bidding that require analysis of current data.
The database doesn’t hold events in memory indefinitely; instead, they are processed by being columnized, indexed, compacted, and segmented. Each segment then persists both on high-speed data nodes and in a layer of deep storage, which serves as a continuous backup.
2. Consistent Data
Data inconsistencies are one of the worst problems in a rapidly moving streaming environment. While an occasional duplicate record won’t make or break a system, it’s much more troublesome when replicated across large numbers of daily events.
For this reason, “exactly once” semantics are the gold standard in consistency. The system ingests an event only one time, without data loss or duplicates. This may sound simple but isn’t easy to attain with systems that use batch mode. To achieve exactly once semantics, both cost and complexity generally increase, as developers must either write complex code that can track events, or else install a separate product to manage this.
Instead of placing this huge burden on developers, data teams need an ingestion engine that’s truly event-by-event – one that guarantees exactly once consistency automatically. Since stream delivery indexing services assign a partition and offset to each event, it’s key to use native connectivity that leverages stream services, confirming every message enters the database once and only once.
3. Getting to the Right Scale
Data events are being generated by almost every activity. Each click by a human in an event, while other events are machine-generated with no human initiation, resulting in a huge increase in the number of events, with streams commonly containing millions of events per second. No database can keep up with this volume unless it can easily scale. A database must be able to not only query billions of events, but also ingest them at a feverish pace. It’s not a viable option to use the traditional database approach of scaling up – we now need to scale out.
The only proven way to handle massive and variable volumes of ingest and query is an architecture of components that are independently scalable. To meet technical demands, different parts of the database (ingest capacity, query capacity, management capacity) must be able to scale up and scale down as needed. These changes must be dynamic, with no downtime created when adding or removing capacity, with rebalancing and other administrative functions happening automatically.
The architecture should also recover automatically from faults and permit rolling upgrades. With this type of database that’s built for stream ingestion, data teams can confidently manage any and all stream data in the environment – even when that means handling billions of rows each day.
Streams are here. Be ready to work with them!