Updated: Oct 5
Timeplus co-founders, Jove and Gang, delivered a talk at Current 2023, sharing why and how to query data in Kafka with SQL. Here is the text version for your reference, and the video recording will be available soon. To see a recording of our talk and get a copy of the presentation slides, sign up here.
If you are a coffee lover like me, you probably know different kinds of coffee recipes are just different ratios of espresso, steamed milk, water, ice, etc.
Just like how there are so many different types of coffee, there are so many different ways to query Kafka data with SQL. Some can be challenging to master and are tricky to use, just like the art of perfecting and savoring espresso. Some are intuitive and easy to use, with beautiful web UI and built-in historical storage, just like drinking a beautifully-crafted and delicious coffee latte.
In this talk (blog), we will be using this analogy between coffee and SQL on Kafka. You can basically replace whenever I say "coffee" with "SQL on Kafka".
Let me show you a report I stumbled upon:
A few takeaways:
There are a lot of data being generated.
30% of data is generated in real-time.
Kafka is the best data streaming platform.
Even so, only 1% of real-time data is analyzed in a timely fashion.
Why SQL on Kafka?
Sure, you can process data in Kafka without SQL, just like you can have a productive morning, without drinking coffee.
You just need to call the vanilla Kafka Java API (wow, this sounds like coffee), and handle state management and caching by yourself. To apply JOIN on multiple topics, you also need to have a good understanding of concurrency, memory management, etc.
Is it doable? Yes, but it can be challenging, and frankly, boring. Today, BIG data is rarely processed without SQL, so why you should process LIVE data without SQL?
Here's why you should process Kafka data with SQL:
Reliable: Worry less about out-of-memory, deadlock, or crashes when you choose a SQL engine to process data in Kafka. It’s no longer your problem – leave it to the open source or commercial vendor.
Faster: There is no index in Kafka topics, so even a simple filter or single data source aggregation can be much faster if we run it as SQL in a streaming database.
Easy: It's so easy, even a kid can analyze data in Kafka with SQL. I asked a high school student to write SQL to process Kafka data and she did well.
Powerful: You can run sophisticated processing logic, not just simple filtering, or flat transformation. You can join data in multiple topics or even cluster, you canhandle late events easily, or even apply machine learning with SQL.
Descriptive: When you process Kafka data with SQL, you don’t need to learn the low level Kafka API. You focus on “what to do”, rather than “how to do”. SQL is already an industrial standard and there are some efforts to refine the standard to better support streaming processing. Your SQL on Kafka can probably be easily ported to run on other message buses, such as Apache Pulsar, or Amazon Kinesis.
Okay, hopefully now you've bought in on the idea of querying Kafka data with SQL. So, what can you do with SQL on Kafka?
Here are some sample use cases our team have implemented recently. Feel free to check out our showcases in our docs. In short, as long as you have fresh data, such as clickstream, IoT sensors, etc., you can build streaming data pipelines or real-time analytics applications, just with SQL.
There are dozens, maybe even hundreds of ways to make coffee, and there are quite a lot of tools to query Kafka data with SQL. Here is a diagram with a few logos, grouped by whether it’s a stream processor, a stream database, or a real-time database. We also grouped them by the primary programming language for those projects. I know it doesn’t mean a lot, but it should mean something, right?
Due to our time limit for this talk (blog), we couldn't cover all technologies. Only select open source projects are listed here.
Let’s go over them from top to bottom, from left to right, starting with systems written in Java and, of course, Flink is the first one.
This is my impression of FlinkSQL. Apache Flink is almost the de-facto standard to apply streaming processing on real-time data.
The FlinkSQL is built on top of low level Flink API.
To connect to your Kafka data, use the CREATE TABLE DDL, followed by a bunch of settings in the WITH clause, such as broker, topic name, offset, and message format.
Once the tables are defined in FlinkSQL, you can query them, or run complex SQL or JOIN.
I've included some coffee tasting notes, based on my opinion. It’s not official or anything, but the main goal is to provide a personal summary for you to compare different tools, to best fit your use cases. Let’s go over them quickly:
Flink's community is awesome, with over 3,000 members in Slack and a good process to share new releases and bug fixes.
It’s low latency, in seconds or sub-seconds.
Quite good for streaming capabilities, but not as sophisticated as Flink low level API.
Flink’s a great stream processor, but doesn't have long term storage. This will change with the new Apache Paimon project, but that’s beyond Flink. Recently, the SQL gateway and JDBC driver were added, but the typical way to use Flink is sending processed data back to a Kafka topic, or send to S3 or Iceberg, or an OLAP system, such as Apache Pinot or Druid.
Flink support large-scale deployment very well. Flink is mature and battle tested. Some organizations running hundreds of nodes with Flink.
Setup a Flink cluster in production is not easy. Need a lot of JVM tuning and still with considerable footprint.
You can get very delicious espresso with that fancy machine. But there is no single button. You need to spend some time learning and maintain it. Flink is not always your best starting point to query Kafka with SQL. Let’s be honest, Flink is hard to learn because of its complexity.
Now let’s talk about ksqlDB. As the name suggests, it’s a DB, a SQL database built on top of Kafka.If you setup Kafka locally or use the Confluent Cloud or Platform, you get both the Kafka cluster and ksqlDB server, just like you get a tin of Coca Cola WITH Coffee.
ksqlDB is the computing engine and uses Kafka as the storage.
You can create a STREAM to connect to an existing Kafka topic or create a new one. You can run INSERT INTO SQL commands to add data. You can also create a TABLE as a Materialized View, to represent what is true as of "now".
There are concepts of Pull and Push Query
You can run pull query to get the current status of the database table or stream.
Adding `EMIT CHANGES` to the SQL to make a “push query” to continuously emit new results based on incoming data.
Similar to FlinkSQL, ksqlDB supports many time window functions, such as tumble, hop, session. You can also set a watermark to accept certain delay of the raw event. By default, you get intermediate results during window aggregations. But you can use EMIT FINAL to get aggregation results on window close.
Also because of that, there are some limitations:
If you have multiple Kafka clusters, then a single ksqlDB server cannot consume data across the cluster
Kafka data is not just in memory, so in theory you can save any amount of data in Kafka and make them accessible via ksqlDB. But in reality, saving multiple months’ data in Kafka won’t be cheap. You probably should not use ksqlDB for long term storage.
On the other hand, there is no index for Kafka data. Querying lots of Kafka data will be expensive. ksqlDB is mainly for streaming ETL, not expected to be queried by 3rd party tools as BI.
Using ksqlDB, a few new topics will be created in the Kafka cluster. Adding some cost or management effort.
There are some limitations for JOIN, such as the target stream or table needs to be co-partitioned, with same number of partitions and same partitioning strategy.
Hazelcast is another open source streaming database. It is a distributed computing and storage platform. It’s mainly in-memory processing, even with cluster support.
Apache Pinot and Apache Druid
Let’s talk about real-time OLAP.
Please check the video recording (available soon) for more details.
Trino is not an example of an OLAP database. A common use of Trino is for federation search. You run SQL on Trino, and it can translate to the low level API to each downstream system.
Manually configuration is kind of complex.
There is no storage or index on Trino. It’s like decaf, you are drinking coffee but there is no caffeine or less caffeine.
We at Timeplus, made our core engine Proton as an open source project, under Apache 2.0 License: https://github.com/timeplus-io/proton
Proton’s key design is the combination of streaming storage (Nativelog like Apache Kafka) and historical storage (column OLAP, based on ClickHouse).
It supports rich capabilities for both streaming and batch queries, for example:
Because Timeplus combines both streaming and batch analysis capabilities, it is as unique as mocha coffee – combining the best coffee and hot chocolates.
Please check the video recording (available soon) for more details.
ClickHouse and StarRocks
The Rust-based OSS Tools for SQL on Kafka
Compared to C++, Rust has following benefits:
Memory Safety. Rust has a strong emphasis on memory safety. It enforces strict rules at compile time, preventing common memory-related bugs like null pointer dereferencing, buffer overflows, and data races.
Concurrency. Rust's ownership system ensures safe concurrency by eliminating data races at compile time.
Syntax. Rust's syntax is morden for sure, with lesson&learns from other languages.
Ecosystem. Rust's ecosystem is growing rapidly but may not be as mature or extensive as C++.
RisingWave is a distributed SQL database for stream processing. It is designed to reduce the complexity and cost of building real-time applications. RisingWave consumes streaming data, performs incremental computations when new data comes in, and updates results dynamically. As a database system, RisingWave maintains results in its own storage so that users can access data efficiently.
There are many sources and sinks provided out-of-box from RisingWave. By leveraging the PostgreSQL protocol, it can be integrated with many systems and tools, such as dbt, Grafana, Superset, DBeaver, etc. We see some similarities between RisingWave and Cappuccino.
Databend is a high performance OLAP. The Kafka ingestion is provided as a separate tool.
More tools available
We just quickly went over how to query Kafka data with SQL, with open source tools.
If we add other SQL-based tools, no matter cloud offerings, or source available or close source, certainly more logos will be added.
Databricks SQL provides streaming SQL on top of SparkSQL.
Materialize, one of the first streaming database, now focuses on the cloud only operational data warehouse
Rockset and Tinybird provide cloud services for real-time data queries
We probably missed a few other logos. This is not a definitive list.
How do you like your coffee?
Here are some of the key OSS projects mentioned in this talk.
There is no good or bad choice. Or there is no one size fit all. Choose the one that works best for you. Or even combine them together, based on different stage of your project, team size, and use cases.
If you’re already using some OLAP systems, check out how they can load real-time data from Kafka.
If complex JOIN or late event need to be properly handled, choose a streaming processor or streaming database.
If you want to simplify your stack, easier troubleshooting end2end, or save cost, considering using a streaming database to unify the realtime and batch processing.
The main purpose of this talk (blog) is to help you explore various options to drink coffee and make coffees. Hope more of you will enjoy coffee, no matter which favor you like most.
To see a recording of our talk and get a copy of the presentation slides, sign up here.