Need for Speed: Evaluating Real-time Analytics Systems
Updated: May 12
Throughout history, the winning bet is usually on the faster beast or organization. Those who can't keep up? Relegated to the dustbins of history.
One of my favorite examples on how "speed wins" comes from Nathan Rothschild, the renowned English-German financier. In 1815, Napoleon was famously defeated at the Battle of Waterloo by the Duke of Wellington, marking the end of the Napoleonic Wars. Ever the savvy financier, Rothschild saw an opportunity, and used speed to exploit it. The Rothschilds had set up an efficient information system, and knew that Napoleon had lost at Waterloo a day before the government did. Rothschild leveraged this time/information advantage to buy up the government bond market, selling two years later for an enormous profit. So in a very real sense, the Rothschild family dynastic wealth was created in no small part from a low-latency data system, enabling Nathan to make bold decisions faster than other financiers. A win for the Rothschild family, and lovers of low-latency analytics, alike!
But what does it mean to be “fast” or "real-time" today? We all consider latency as a key benchmark. But how to define latency? We believe the diagram below does a good job of illustrating it:
In the age of the big data, the value of BI or analytics systems is to turn data into insight or wisdom. So, we see the definition of latency as the delay between data being generated and collected to data being processed and ready for decision making. A real-time analytics system’s value is to make such processes happen with minimal delay. Our definition contrasts with traditional latency measurement standards used for query response time, which measure the amount of time it takes to process historical data. As such, this definition doesn’t account for the time between initial data generation and actual query time. Depending on the query schedule, it could be minutes, hours, or even days later. This additional time lag is expensive for enterprises. As Nathan Rothschild’s story shows us, speed wins. But what’s the best way to achieve speed in your own analytics system?
Evaluating Real-time Analytics Systems
To understand which real-time analytics system will perform best for you, it's important to dig deeper on how a vendor chooses to define latency, as there is no universal definition. In his book Streaming Data, Andrew G. Psaltis categorizes real-time systems as hard, soft, and near, measuring latency from microseconds to minutes:
Query Response Time
In a traditional OLAP or Data Warehouse system, latency is often thought of as the response time of a query. It is defined as the delay between query start and query response. So when you evaluate how fast an OLAP system is, this response time can give you a good answer. It means how fast the system can process specific analytic requests.
The response time here is defined as the time difference between the start time of analytics and the result of analytics. The shorter response time might be helpful for dashboarding or ad-hoc query purposes. However, this kind of ‘latency’ has nothing to do with the ‘pre-query’ process between the time when the event happens (t0) and the earliest time when data is available for query. Furthermore some systems even make this ‘pre-query’ process very heavy or longer - with ETL, pre-computing, multiple-indexes etc. - which end up with longer end-to-end latency. The query actually handles ‘old’ data that happened in the past. This lag risks dramatically reducing the value of data for real-time systems.
In a real-time streaming analytics system, the latency is different. Since the query is running in a streaming mode, and there is no bounded data start and data end, the response time makes no sense here. The latency is defined as the delay between the generation of the event and the event getting processed and emitted as an analytic result.
The event latency here is defined as the time difference between when the event happens and when the analytic result is available (t1-t0). So this latency can be used to evaluate how long it takes to make a decision when events happen.
For a real-time system, it is important to enable immediate action when a specific event happens, so keeping the event latency low is important. For a traditional OLAP system, if the user wants to make the analytic result be ready as soon as the event happens, the user has to run the query continuously, usually a scheduled query, say run a query in a interval of every 5 minutes. Assuming an anomaly event is randomly distributed in all event, a user can expected that the latency will be about 2.5 minutes. In case the user wants to lower latency, what he/she can do is decrease that scheduled interval. But usually an OLAP query is heavy, and running such a query frequently is usually complicated and expensive. In a streaming-based real-time system, the event is processed in an incremental way and the query is kept running. Costs are minimized while still achieving ultra-low latency analytics.
There are two main camps that leverage different strategies to solve this problem, OLAP and Streaming Processing.
When talking about analytics systems, people often will first think about OLAP or data warehouses. But these traditional batch-based stacks are not designed for real-time processing. To make query response times lower, pre-aggregation is used, but pre-aggregation usually means higher ingestion latency: the system has to spend more time and space to ingest and store the data. This makes the data available for analysis slower. Recently however, there are so-called ‘real-time’ OLAP systems such as Apache Pinot, Apache Druid, Clickhouse, which try to solve query response time by optimizing the system design and even improving data ingestion.
Still, there are limitations to these "real-time" OLAP systems. For these systems, data might be ingested faster, but overall data processing is still batch-based and index-centric, and the query is triggered by the user. So when new data is generated, how soon it will be involved in a business decision is fully dependent on when the pre-query processing is done and when the user starts to ask the question. In reality, this takes longer than the user expects.
Offerings such as Apache Flink, Apache Spark Streaming, and KSQL DB leverage streaming processing techniques, continuously processing data as it arrives to achieve real-time analytics.
With a streaming processing system, the query is continuously running. As soon as the data is generated and enters the system, the data will be processed and the analysis results will be pushed to the user immediately.
The limitation to the streaming processing is that these systems are usually not persisting data for analytics purposes, e.g. to correlate historical data, systems might need big internal or external storage for state stores or they just cannot support it. For example Spark and Flink do not store data by themselves, Materialize stores data in memory and KSQL needs to store data in Kafka and state in rocksDB. Furthermore, to analyze massive time-series events across both latest and historical windows, for example, the user may want to compare or correlate the latest data with past data or enrich the fresh event using historical analysis results. To handle such complex cases, event-only design doesn't fit perfectly.
So what is the difference between these two camps? I believe the key difference here is how fast the analysis result is available for making business decisions when new data is generated. For OLAP, it is a kind of human-driven decision making process. Data is stored into a database and the human will decide when to ask the question. With a streaming processing system, decision making could be done continuously as soon as new data is generated.
So which system to choose?
OLAP If your decision making is driven by humans, the system is not super latency sensitive, and you have a huge amount of historical data, an OLAP system is a better choice.
Streaming Processing If you want data to be the driver of your decision making and are more focused on fresh data with less historical data for analysis, streaming processing is a better choice.
...But What If I Need Both Historical + Streaming Functionality?
However, for a number of years, what my developer friends often have told me is what they really want is one converged real-time analytics platform that can support ultra-low latency data-driven decision making, but also support additional analytic capabilities like connecting to historical data, doing data enrichment or analyzing multiple-tiers data according to time.
I fully agree! That's exactly why I co-founded Timeplus.
Timeplus is a purpose-built streaming analytics platform that solves enterprises' need for easy-to-implement real-time analytics. Timeplus adapts a streaming-first architecture to redesign real-time analytics from data to action, helping enterprises analyze massive sets of streaming data faster. Timeplus has an embedded native log storage engine that supports persistent fresh and historical data under the hood, making all kinds of analytics available for users. Timeplus has helped our customers build many different real-time analytics applications including real-time pricing for financial use cases, real-time security for enterprise communication monitoring and real-time anomaly detection. Timeplus can provide ultra low end-to-end latency with scalable high throughput.
The diagram below visualizes the end-to-end event latency comparison on my MacBook, which is about how fast a real-time data analytic job can be done by various data stacks include Splunk, Materialize and Timeplus:
The T-shirt sizes S/M/L mark different event volumes per second of real-time data sent to the stacks, the horizontal bar shows how fast different stacks can process that data and make the analytic result available to use. When the workload is small, all stacks can do similar jobs. When the workload increases, Timeplus demonstrates 3-10X in M size and 10X+ in L size faster than the other stacks without losing any data. (We will write a follow-up blog to introduce more details about this performance benchmark and testing results soon.)
About the Author
Gang Tao is a co-founder and CTO of Timeplus. He lives in Vancouver and is an avid fan of AC Milan. Feel free to send any questions or comments to: firstname.lastname@example.org