top of page
  • Writer's pictureGang Tao

Proton: An Open-Source Alternative to ksqlDB for Streaming Processing

Stream processors exist to extract perishable and valuable insights from streams of data. The digital world has long worked with Big Data. But Fast Data is becoming more of a concern as devices come online and produce data, and as decision windows shrink. Smart watches detect cardiac arrhythmias and prevent heart attacks, online shoppers expect highly relevant recommendations and promotions, fraud detection systems must be aware of buying experiences across devices and channels, the examples go on. 


The common theme with all those use case patterns is simple: time to insight.


But systems supporting insight must not only be aware of and reactive to datastreams: they must be easy to set up, to use, and to maintain. A couple of open-source stream processing frameworks have become popular in the last decade due to ease-of-use (give or take a few years): ksqlDB and Apache Flink’s SQL API.


This blog describes ksqlDB in particular, and contrasts with an open source engine named Proton, and explains 5 reasons to use Proton.


(We are also working on a comparison with Apache Flink. If you want to read more on Flink and Proton, join our Slack Community and let us know!)


 

Here is a high level comparison of ksqlDB and Proton:


ksqlDB

Proton

License

Confluent Community License (CCL)

Apache 2.0

Resource consumption

High

Low

Support for other messaging systems

Kafka only (dependency)

Pulsar, Kinesis, and more (supports but not dependent)

Performance

Good

Excellent

Stateful streaming processing

Yes

Yes

Bounded, pull-based query

Yes

Yes

Unbounded, continuously push-based query

Yes

Yes

Materialized view and table concept

Yes (on top of RocksDB)

Yes (on top of ClickHouse)

Language

Java (on Kafka Stream)

C++

Kafka Connection

Yes (source and sink)

No (supported by Timeplus Cloud or Timeplus Platform)

Cluster and HA

Yes

Yes, based on ClickHouse

Security

Yes, role-based access control

Yes, based on ClickHouse

User Defined Function

Java

JavaScript


Some history: ksqlDB (previously known as KSQL) is an open-source streaming SQL engine for Apache Kafka, a popular distributed streaming platform. It enables you to build stream processing applications using SQL syntax, making it accessible to any developer familiar with SQL. The goal was for ksqlDB to simplify the development of real-time data processing applications by abstracting away the complexities of low-level Kafka APIs.  


There is a famous paper, Streams and Tables: Two Sides of the Same Coin, which highlights the duality of stream processing, where data can be viewed and processed either as a continuous stream or as a series of structured events organized into tables. The ksqlDB engine implemented concepts presented in the paper. 


ksqlDB is built on top of Kafka Streams, a stream processing framework part of Apache Kafka. It leverages a lightweight plugin-oriented approach, and implements a stateful stream processing system on the existing Apache Kafka ecosystem.



There are some benefits by using ksqlDB:


  • SQL Interface One of the significant advantages of ksqlDB is its SQL-like interface, which allows developers who are proficient in SQL to easily work with streaming data without having to learn new programming languages or frameworks.

  • Real-Time Streaming Processing ksqlDB enables real-time processing of streaming data, allowing developers to build applications that can react to events as they happen, making it suitable for use cases like monitoring, fraud detection, and real-time analytics.

  • Integration with Kafka ksqlDB integrates seamlessly with Apache Kafka, making it easy to build end-to-end streaming applications leveraging Kafka's scalability, fault-tolerance, and durability features.

  • Stateful Processing ksqlDB supports stateful stream processing, allowing you to maintain state across streams and perform operations such as aggregations, joins, and windowing.

  • Scalability Since ksqlDB is built on top of Kafka Streams, it inherits Kafka's scalability features, allowing you to scale your streaming applications horizontally by adding more instances as needed.

  • Security Supports role-based access control (RBAC) which provides great security features.



However, there are also limitations with ksqlDB:


  • Deep Coupling with Kafka ksqlDB is tightly coupled with Kafka, at the deployment level. Each ksqlDB server is binded with a Kafka cluster, and ksqlDB uses Kafka as storage to keep lots of internal state. The consequence of this is that there is no way to process streams from different clusters unless you route the data from different clusters into the same Kafka. Additionally, while running ksqlDB, it will impact the Kafka cluster by creating more internal topics with extra read and write.

  • Heavy Resource Consumption Every SQL query run on ksqlDB is a Kafka Streams application, which creates its own worker threads. This means that every query uses its own consumers and producers, which adds overhead to every query. ksqlDB uses Kafka topics to store state changelogs and using RocksDB to materialize these changelogs into tables, which means more resource consumption for the state.

  • Not Designed for Analytics A real-time analytics is usually implemented by two optimizations: streaming processing which can processing data in a incremental way by keeping the computation state, this mean the analytic result is emitted as soon as the event happens, the other is it can quickly scan huge amount of data by skipping lots of irrelevant data.  With Kafka powered streaming processing capability, ksqlDB can support the first one, but the rocksDB based key-value storage used as the Table storage is not a good one for the second one. (Even there are analytics database build on topic of KV storage, there is usually extra layer to support analytics)

  • Not True Open-Source ksqlDB is licensed under the Confluent Community License, and there are many limitations with that license. For instance, it cannot be used for commercial purposes, and all source edits must contribute back to it.



Beside these limitations, the acquisition of Immerok by Confluent also raises doubt whether Confluent will continue its investment in ksqlDB, since it’s counterintuitive to maintain two similar products within the same organization.


If you are in the market for a streaming processing tool, or you are currently using ksqlDB but share these concerns, here is an alternative for you to try.


 

What is Proton?


Proton is a streaming SQL engine, a fast and lightweight alternative to Apache Flink, powered by ClickHouse. It enables developers to solve streaming data processing, routing, and analytics challenges from Apache Kafka, Redpanda, and other sources, and send aggregated data to downstream systems. Proton is the core engine of Timeplus, a cloud native streaming analytics platform.



The above diagram is the high level architecture of proton, with these core components:


  • Streaming storage, which is similar to Apache Kafka, an append-only log that can handle real-time streaming data with super low latency and high scalability

  • Historical storage, built on top of ClickHouse, providing high performance historical data query leveraging column data

  • Unified query processing, which can run SQL-based, incremental, stateful streaming processing just like Flink, but unifying the streaming and batch (historical) modes

 

Common features compared with ksqlDB


There are lots of similarities between Proton and ksqlDB. Both support:


  • Stateful streaming processing

  • Data persistence

  • Stream/table concept

  • Dual query mode, long running, unbounded push based query and bounded, pull base query

  • Query Kafka data and write analytic results back to Kafka

Generally speaking, most features of ksqlDB can also be found in Proton.


 

Why Proton is a better alternative to ksqlDB


Along with positive similarities, Proton offers additional benefits:


  • Proton demonstrates stronger performance Proton is written in C++ and built on top of ClickHouse, notable for its outstanding performance. Leveraging SIMD, specially designed internal data format and other optimization techniques, Proton can process over 1 million records per second on a commodity computer.

  • Proton is more flexible when consuming Kafka data Proton supports Kafka external stream, unlike other streaming processing systems, where Kafka is offered only as a source or sink. Proton takes Kafka as a stream, though no direct data is persisted. The user can still create a materialized view in case data is required to be persisted, but more flexibility is provided to the user. External stream supports both read and write, and no extra concept is introduced. (Proton does introduce a new concept, stream, which is an extension to a regular database table.) When working with Kafka, there is no direct coupling between Proton and Kafka, so users can query any data from any Kafka cluster.

  • Proton is purpose designed for analytic workloads Similar to ksqlDB, Proton supports both bounded historical query and unbounded streaming query. Referring to the architecture diagram shown earlier, there are two storage components: one append-only log for streaming storage, and one column store based on ClickHouse for historical data. This is why Proton offers great support for analytic workloads – when joining real-time streaming with historical data, Proton can quickly scan lots of historical data powered by the ClickHouse column data store, this is something that RocksDB cannot provide.

  • Support complex computing by JavaScript UDF (User Defined Function) To support complex computing logics that SQL cannot deliver, Proton supports writing user defined functions in JavaScript. With an embedded JavaScript engine, Proton extends query capabilities to a wider range of businesses. Compared to the Java based UDF that ksqlDB uses, JavaScript is much simpler to use (as there is no process to build) and deploy, and developers don't need to worry about JVM version, Kafka version, or dependency versions, and instead truly focus on how to implement customer business logic.

  • Proton is more developer friendly Proton’s open source license is Apache License 2.0, which is more open compared to CCL. Developers can use, update, or redistribute it for free without any limitations.



 

Summary


Both Proton and ksqlDB can be used to process your Kafka data in real-time. For an open-source option offering support for analytic streaming and historical data use cases, Proton can be a stronger choice for your needs. Stay tuned for our upcoming blogs which will share use cases and technical details!


To explore Proton yourself, visit the Proton GitHub repo or create your own workspace on Timeplus Cloud with a 30-day free trial.


 

References