The creators of Apache Kafka have enabled millions of data engineers to better ingest and manage their streaming data. As the streaming ecosystem evolves, new challenges have emerged. Today, data engineers face two major problems in real-time data integration:
Data movement between source and destination is not happening in real-time. Lots of data integration systems are pulling data and there is a minimal pulling interval, which adds data latency between the source and the destination system.
Data analytic platforms need to support a huge number of different data sources. Back when I was an engineer on Splunk's data platform team, one of our yearly goals was to develop 50+ different data sources for the Splunk enterprise platform. That was quite a lot of different sources of data types, and that was only part of the customer requirement.
Kafka Connect is the answer for this real-time data integration problem. Kafka Connect, an open-source component of Apache Kafka, allows users to integrate Kafka with external systems and build a scalable and fault-tolerant data integration pipeline for streaming data between Kafka topics and other data sources or sinks.
Leveraging the streaming design of Apache Kafka, Kafka Connect pushes streaming data into target sinks in real-time, so there is no need to wait for the next pulling cycle. Data is synchronized to the target as soon as the change happens.
The other benefit of Kafka Connect is its simplicity and ease of use for building and managing data pipelines. It provides a standardized and extensible framework for connecting to various data sources and sinks, eliminating the need for custom integration code. There are currently more than 160+ connectors available on Confluent Cloud. The strong community support and seamless integration with various components within the data ecosystem plays a key role in the success of Kafka.
Kafka and Kafka Connect solved the problem of transporting data from source to destination in real-time. Now, there is no blocker for the analytic system to answer the question of what's happening now, since all fresh data changes are synchronized.
3 Options to Solve the Real-Time Analytics Problem
In most cases, the target of data integration is analytic data systems. Not all analytic systems are designed to handle real-time analytics and process the data while change is happening. Instead, they are designed to ingest, index, scan and process large volumes of historical data.
There are some modern data analytic systems designed for real-time, including:
Real-time OLAP Databases
By optimizing data ingestion and data processing, these OLAP databases provide low latency real-time data queries. Typical products include: Apache Druid, Apache Pinot, Clickhouse, Apache Doris, Rockset
Pros: provides low latency response for interactive ad-hoc queries
Cons: there are no streaming processing capabilities, it cannot support users to continuously get updated query results
Streaming Processing Systems
Pros: Streaming Processing Systems process data as soon as new data enters the system, which provides continuous streaming analytics results.
Cons: there is no data storage. Users still need some other systems to store the analytic results. These systems are also not designed to run ad-hoc queries. A new data processing job/pipeline needs to be created in case different queries are introduced.
In addition to real-time OLAP Databases, there are also databases that leverage streaming processing as its core data processing method to support real-time analytics. Typical products include: Timeplus, Materialize, RisingWave, ksqlDB.
Pros: it supports both streaming processing and has its own data storage
Cons: usually the capability to handle historical data analysis is quite limited
There are some reasons why I believe streaming databases can be a great solution for real-time analytics combined with real-time data integration.
High performance and low latency streaming query support Most of these streaming databases are designed using incremental streaming data processing. The query process keeps the computation state internal which can save a lot of computation efforts. At the same time, with continuous query processing, the latest analytic result will be pushed to the audience in real-time with super low latency.
Simple and easy to use Streaming processing is usually hard to use and manage, there are lots of problems to solve such as event ordering, time window, state management, flow control, backfill, exactly-once semantic, fault tolerance, etc. Streaming Databases handle these problems transparently and the end user only needs to understand how to write SQL, which makes it a good upgrade in case the user doesn't want to manage complex streaming processing configurations.
Lower TCO By providing incremental streaming data processing, Streaming Databases save significant computing resources compared to scanning all the data from scratch. Streaming Databases also often leverage SQL, which is data engineer-friendly. No need for extra engineer resources to develop complex data processing applications – another way to save on costs.
At Timeplus, we build and expand upon the core Streaming Database concept. Timeplus is a real-time data analytics platform with a streaming database as its core engine. Timeplus provides all the benefits of streaming databases, and also enables developers and data analysts to:
Effectively combine real-time streaming data and historical data processing. (Refer to: How Timeplus Unifies Streaming and Historical Data Processing)
Access a single platform that covers the whole analytic lifecycle from data ingestion to action, including alerts and dashboards
Leverage rich analytic features with streaming SQL, such as semantic data revision and mutability support (refer to Unlocking Real-time Post-trade Analytics with Streaming SQL), semi-structured data support and schema-on-query, and flexible user defined function for extensibility.
From an architecture point of view, there is another benefit of combining Kafka Connect and Timeplus. When working with Kafka Connect and your data is in your private data center or VPC, where there is no public data access to your Kafka cluster (as Kafka Connect is pushing data using Timeplus’s Rest API), there is no need to expose your data. Meanwhile, if adding a Kafka source in Timeplus, there has to be a public network connection between Timeplus Cloud and the Kafka Cluster.
The combination of Kafka (Event Streaming) + Kafka Connect (Real-time data integration) + Timeplus (Streaming Database) represents a new way to handle real-time data analytics. Leveraging streaming-first design, these systems are naturally complete each other. When building your own real–time data analytic system, this is the architecture to consider..
Organizations seek real-time insights for what's currently happening with their data. To achieve this, data engineering teams can leverage the power of real-time data architecture with tools like Kafka, Kafka Connect, and analytics systems. Timeplus is a high performance engine designed for this purpose.
We are thrilled to announce the Kafka Connect plugin for Timeplus is now available on both Confluent Hub and Confluent Cloud. In Part 2 of this blog, we’ll share how to easily configure a Timeplus Sink Connector on Confluent Cloud, and move your data from a Kafka topic to Timeplus.