With the recent rapid evolution of real-time data sources, such as IoT, sensor technologies, wireless communications with 5G networks, powerful mobile devices, and electric vehicles, a more efficient way to process high-speed and real-time data stream is required. Users need a system that can process, detect, and predict real-time information with the lowest latency and highest throughput.
In this post, we'll guide you through some of the core principles that we believe defines successful real-time analytics architecture, and how we've implemented those principles at Timeplus. But first, some context.
The team at Penny Stocks Lab has designed an interactive infographic that visualizes what's going on in the virtual world, every passing second - from YouTube videos to Google searches, from Instagram likes to every email sent. You can see that every second, 20+ TB of data will be generated. Data generation keeps accelerating, yet most data analytics systems haven't kept up.
The value of data is highly correlated to the age of the data, or the "freshness" of data. The value of data starts to decay quickly. According to Pinterest, based on the data observed by their data analytic system, more than 98% of queries are on data age within 35 days. To fully leverage the value of the data, the data processing needs to be in real-time.
The problem is: most companies haven't invested in real-time data analytics the way Pinterest has. Most companies have traditional data analytic systems that were built to provide insights into what has happened in the past. They weren't built to answer the question of what is happening at the moment, and give real-time analysis results as data arrives at the system.
So how to build powerful real-time data capabilities that can keep up with your customers' needs? Here are some core principles to consider:
Data is processed as a stream and never stops
New data is always being generated, and in most cases, the data source will never stop generating new data while you are running your analytic tools. A streaming processing system treats all data as an unbounded, immutable, fast-changing event stream.
As stream data is unbounded, the analytic is running when the data enters the system, with low latency incremental processing, the analytic result is sent to the user in real-time.
Here is an example of a streaming process for a simple analytic case where the user wants to know the total value of a number stream. the event in the stream contains a single value which could be sensor data, the event data is generated in real-time and it won't stop. The streaming process engine continuously runs the SUM operator and keeps posting the processing result to the user.
Time is essential
Event time is the most important attribute of streaming data
Time is the most important characteristic of a stream. To be specific, the Time here is the Event Time of each event.
Event time is the time when the event is generated. As we mentioned before, the event data has its life cycle, newly generated data are usually more valuable than aged data. Here you can take event time as the birthday of the data.
Real-world streaming data is usually non-perfect which means the event may enter the system with a long or short delay, events are coming out of order, the event time marked by the original system may have a different clock time than the processing system, all of these will make the time related processing more challenging. A streaming analytic system has to handle these cases.
From the above samples, you can see due to the network transmission, the event enters the system in a different order than the original event generating order. So the analytic system has to wait for the late event, and usually, the system does not know when these late events will arrive.
With unbounded streaming analytics, users may still want to have an aggregated analysis result from time to time, using window-based aggregation is a common tool, such as tumbling or hopping windows. Event time is what the analytic system used to decide a specific event belongs to which window.
As different aged data has different values, the analytic system wants users to fully leverage the value of newly generated fresh data, so high-performance streaming storage is required to store fresh data. While the aged data might also be helpful in some cases, for example, what if the user wants to compare the current trend with what happened at the same time last year. Both aged data and fresh data are required, so the data might be stored on different layers according to their age. So Event time will decide how to store the stream data.
Streaming data is dynamic, where the schema is flexible and prune to change
A lot of streaming data sources do not follow a fixed schema. For example, JSON data is widely used as streaming data where fields might be added/removed from time to time. Avro or Protobuf can be used to support schema version evolution where data might switch to a new version of the schema. Needless to say, the log case where the plain text-based log has no schema at all.
Traditional data warehouses heavily rely on ETL tools to make data well prepared before analysis, this will make the data ingestion process heavy and time-consuming. In an analytic system, such a step means the data will not be available for analysis with low latency requirements since ETL will take time and event will not be available to use right after entering the system.
The above diagrams shows the concepts of flexible schema, there are three sensor data sources generating data with fields of id, location, time, temperature, humidity, speed. From time to time, the sensor might be upgraded to a new version, where new fields might be included in the events, the event enters the system might contain different fields at different time.
A well-designed streaming system should handle the dynamic schema smartly, which means the user does not need to worry too much about the schema change.
Timeplus: a purpose-built real-time streaming analytics platform
At Timeplus, we spend a lot of time on "first principle" thinking. It took a lot of work, but we're proud to say we've built a time-series data analytics platform that really adheres to these core principles, while delivering a user experience that is intuitive to a broad range of data analyst users. Some key attributes of our platform:
Timeplus is an end-to-end “streaming” data analytics platform, where data is taken as an unbounded stream and analyzed using streaming SQL. The streaming SQL will be continuously running, providing analysis results as soon as the event stream enters the system.
Timeplus takes “time” as a first-class citizen. Window-based aggregation like tumble and hop are supported by streaming SQL. Leveraging watermarks, late events are well processed. Data is stored on different layers according to the data age.
Timeplus supports flexible “schema” with different strategies, users can choose to do information extraction at run time using streaming SQL or enable dynamic schema evolution.
Here's a diagram of our overall Timeplus architecture:
In a recent performance test, with a single commodity machine, Timeplus has shown strong performance with:
4 milli-seconds end-to-end latency
10 million events per second throughput
Here are some of the key design features that explain why Timeplus can achieve such a good performance:
Timeplus stream ingestion engine turns stream data into a private data format called TFF (Timeplus file format) which is a column-based data storage format, designed for high performance stream data serialization and deserialization.
TFF is stored on a high performance stream storage called TNL (timeplus native log), similar to Apache Kafka which provides high-speed data storage and access with low latency, durability, and scalability, but it is highly optimized for stream data analysis with a super low footprint. Stream data is stored at different layers according to the data age, hot data is kept in memory to make fast access possible. Recent data is stored in TNL with high performance sequential reads. And the aged old data is written to disk partitioned by event time to make sure when it is required, it can still be accessed with relatively high performance.
Stream SQL Query is running in Timeplus’s query engine, with TFF’s column-based data format, streaming data is vectorized to make analytics workloads, such as aggregation, super fast by leveraging modern computing technology such as SIMD (Single Instruction Multiple Data).
Timeplus core engine is written in native C++ and has made a lot of optimization for incremental operations and memory management, since stream processing requires incremental computing and lots of memory operation to manage stream state, with these optimizations, there is no GC issue which means latency is guaranteed without fluctuation.
We will share more details on our platform's technical features in future articles, so stay tuned!
Have you been struggling to find the expertise, time and budget to build your team's real-time analytics? Don't worry! We're here to help. Try our fast, powerful, and intuitive streaming-data analytics platform by joining our beta. BTW, if you are a software developer and share our passion for solving interesting/complex streaming problems, come join us! For details, see: https://timeplus.com/careers.
Life is short. Make your data analytics faster!