Need for Speed, Part II: Real-time Systems Deep Dive
Updated: Feb 24
How fast are modern real-time analytics systems?
In our last blog, we summarized how to properly evaluate the speed of analytics systems. We introduced the concepts of response time, key latency metrics and what these metrics mean for your own data stack. We also shared performance test data demonstrating how Timeplus outperforms other real-time data analytics stacks across these key metrics. In today’s post, we’ll dive deeper to explain how we arrived at these performance test data results.
A performance view of stream event life cycle
The above diagram shows an abstracted view of what happens when a stream event is processed by a real-time data analytics system:
Raw events are generated/created/emerged at a data source, we label this time as T0. This time is what we call event time and this is when the event starts its life.
Raw events are collected and sent to the data analytics system or pulled by the data analytics system, the event has not changed. We label this time as T1. We can call it enqueue time (considering data ingestion using a queue).
Raw events are processed for internal storage, usually an indexer, file, database or memory. This process is usually called the data ingestion process. Raw events get transformed into whatever the system believes is the best format to be used later. We label this time as T2. This time is usually called index time.
The query processing loads indexed events and starts executing the query computing. We label this time as T3. This time is called processing time. One specific raw event might be involved in different queries, so the processing time is an attribute of a query.
The query processing is completed and the query result is ready and starts sending the result to the end user or downstream systems. We label this time as T4. This time is called emit time.
Finally the end user or downstream system received the query result. For a specific query, this is the end of the whole life cycle.
To make a real-time data analytics system fast or high performance, the most important attribute is latency. From the user’s perspective, the latency is defined as the difference between T5 and T0. From the analytic system’s point of view, what it can really contribute is the time difference between T4 (emit time) and T1 (enqueue time). The latency should be as low as possible to make the newly generated data involved in an analytic job without delay. To decrease this latency, a system should be designed as:
Fast data ingestion Preprocessing data is sometimes time consuming. When there is a long data ingestion time, even if raw data is ready, the data cannot be involved in any query. So a real-time system must be able to ingest data with super fast speed.
Fast data loading Some data analytic systems will build different indexes to help load data quickly by skipping unnecessary data reading. The compromise is to increase the indexing time (T2-T1). Other systems like Flink or Materialize, which do not index data, raw events are directly stored in memory and are consumed by query processing. The compromise is loss of persistence for event storage.
Fast query computing How to optimize query processing is handled by every database system, making the query processing fast will contribute to reduced latency.
To better understand how fast a real-time analytic system can be, I made some performance tests to evaluate the latency and throughput of different real-time analytic systems with different workloads. In the test, the latency is defined as T5-T0, which is an end-to-end latency that our test tool can easily observe.
Throughput is another important metric to evaluate: it shows how much data a system can process in a specific range of time. In this test, we use real-time queries provided by the analytic system to observe how many events are available during the observation to calculate the throughout. So it is also an end-to-end throughout, the data must be available to the observer to be counted as part of the throughout.
Usually, throughput and latency are correlated, higher throughput usually means high latency. If a system is optimized for low latency, it often comes with the compromise of throughput decreasing. So one of the important questions to ask for a real-time system is whether the latency will be increased significantly when the throughput increases.
In the test, we tried to solve the complexity and uncertainty issues to provide an objective benchmark result.
Modern data processing systems are complex, and there are many different configurations that will impact the system performance. It takes time to understand how the system works to make a fairness performance test. And it is usually impossible to test different configuration combinations, so the performance test is usually designed to just test the default configuration. To solve the complexity issue, we define workloads as different T-shirt sizes - S / M / L - and it can be easily extended to different types of workload types by configuration.
You cannot observe a system without impacting it. Similar to the uncertainty principle in quantum mechanics, when you observe a target system, it is inevitable that the target system will be impacted by the observing itself. To make a performance benchmark, whatever the technology you are using, like logging, metrics, inspection code, all of these will impact the system. Lucky data analytics systems - our target system - can be observed by the query, so we don't have to introduce extra impact.
This diagram shows how to design the performance test.
There are three core components involved in the test, event generator, target system and observer.
The event generator contains multiple concurrent writer threads that writing streaming data to the target system through the interface provided by the target system.
The source event is a json data of a simulated order event, here is a event sample:
The events are randomly generated, the only true data is time, the time field is the local time when the event is generated.
To control the volume of the workload, these parameters can be configured:
Batch size, how many events contained in each write
Interval, the wait time between two generated events in each writer thread
Concurrency, how many writer threads (in our case, it is goroutine) are created
Source replica, each generator instance has its own limitation, to get a high volume of data loads, multiple instances of generators are required, this parameter defines how many generator instances are created.
I chose three data analytic systems that support real-time data processing to make the evaluation: Splunk, Materialize, and Timeplus.
Splunk (NASDAQ: SPLK) is the data platform leader for security and observability. The core product Splunk Enterprise provide a real-time search which is defined as following:
A search that displays a live and continuous view of events as they stream into the Splunk platform. With real-time searches and reports, you can search events before they are indexed and preview reports as the events stream in. Unlike searches against historical data, time bounds for real-time searches continuously update. You can specify a time range that represents a sliding window of data, such as "data that has been received over the past 30 seconds." Splunk Enterprise uses this window to accumulate data, so you will see the data after 30 seconds pass.
Materialize is an open source SQL database with millisecond-level latency for real-time data. It is built on top of Timely Dataflow and Differential Dataflow,
Timeplus is a purpose-built streaming analytics platform that solves enterprises' need for easy-to-implement real-time analytics. Timeplus adapts a streaming-first architecture to redesign real-time analytics from ingestion to action, helping enterprises analyze massive sets of streaming data faster.
The observer records the performance metric by making an observation through the query provided by the data analytic systems.
For latency, a query is executed which simulates an anomaly detection use case. A different hit ratio is defined the probability of the anomaly event, to make it simple, we use the field `value` to control the anomaly, in my test, I defines value = `9` as an anomaly, so just configure the max value of that field will control the related hit ratio. A lower hit ratio means the query result will contain less events.
Once the observer gets the event from the query result, it will get the latency by calculating the time difference between the event time (T0 ) and current time (T5).
For the throughput, it is observed by running a query with a tumble or hopping window, for example how many events are there in the past 10 seconds, this value divided by 10 will be the observed throughput.
Following table shows the different queries we used to observe the latency and throughput
search index=main source="my_source" value=9 "earliest_time"="rt" "latest_time"="rt"
search index=main source="my_source" | stats count "earliest_time"="rt-10s" "latest_time"="rt"
select value, time from test where value=9
select '1' as a, count(*)/10 from test where mz_logical_timestamp() < timestamp + 10000 group by a
select value, time from test where value=9
select window_start, count(*)/10 as count from tumble(test,_tp_time, 10s) group by window_start
Timeplus Chameleon is a golang-based open source tool I used for this test. Please refer to the code for detailed information.
Test Data Configurations
Three different configurations (S: small, M: medium, L: large) with different data scales are tested to help understand how these systems work under different workloads.
Search Hit Ratio
The test is running on an AWS c5a.12xlarge EC2 instance.
Related software versions are:
All the target systems are running in docker without any resource limitation.
Refer to the code here about how the system is deployed.
Here is the test result about the throughput and latency.
Performance Data Analysis
Here are some key observations and analysis based on the test result:
Small (30K Write EPS):
All the stacks can reach 30K eps, but the latency data shows big difference
Timeplus can achieve less than 10ms latency which is 100x faster than Materialize and 1000x faster than Splunk.
One thing to mention here is that the max throughput observed from materialize shows a number bigger than 30K, as the data generation is 30k eps, so the query processing might be not accurate or the processing speed is not stable.
Medium (120K Write EPS):
Both Splunk and Timeplus can reach 110K+ eps throughout, while Materialize can only process 35K (29% of Write EPS) . Timeplus latency increased to 20ms which is still way faster than Splunk 1.4s (69x) and Materialize 1.5s (76x). It is interesting that Splunk latency is lower compared to small size configuration. This is due to the fact that as the hit ratio decreased 10 times, but the write EPS only increased 4 times, which means the search has a lower load which made some significant impact to spunk search latency.
Large (1.44M Write EPS):
Timeplus achieved an amazing performance of 1.15 million eps throughout, while Splunk can only have 134k (9% of Write EPS) and the Materialized throughput is even smaller than medium configuration with 13k only (1% of Write EPS)
With large amount of data lagged behind, the latency of Splunk and Materialize latency are still way higher than Timeplus (17x and 36x respectively)
Only Timeplus can handle 1 Million+ data processing in real-time with sub-second end-2-end latency (260 ms).
The above diagram visualizes how fast these different latencies and throughput are. We have a fixed volume of data with 1,000 batches and use a query to test when all these events are available to the observer. With small configuration, there is no difference among those stacks, with medium configuration, Materialize seems slower than the other stacks, with large configuration, only Timeplus can still keep working.
The test code is now open sourced; you can try out different configurations and different scenarios to test our product for your own considerations. In our test, Timeplus has shown stable performance across small workloads to high-volume workloads, with sub second latency and 1 million plus EPS.
Sound compelling? You can learn more about our latest product features by joining our Slack channel, or by directly registering for our private beta program. We look forward to solving your need for speed!