Jove Zhong

Feb 277 min read

Native Stream Processing/ETL for ClickHouse: Now Available in Timeplus Proton

We’re thrilled to introduce a major upgrade of Timeplus Proton, our open-source core engine, which you can now natively integrate with ClickHouse. Proton enables developers to send streaming SQL results to ClickHouse and to read data in ClickHouse or apply SQL JOIN without duplicating storage.

Whether you are using ClickHouse Cloud or local/self-managed versions of ClickHouse, Proton allows you to process live data in Apache Kafka or Redpanda, and send them to ClickHouse for fast and reliable real-time streaming analytics.

If you are a visual learner, you can refer to this demo by our Co-Founder and Head of Product, Jove Zhong, for how to set up the integration of Proton and ClickHouse:

Love ClickHouse and Kafka? Let Proton take it to the next level.

Apache Kafka and ClickHouse are two of the most widely used and respected open-source projects, so it’s no surprise that developers often use the two together.

ClickHouse documentation notes: “Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. In most cases involving Kafka and ClickHouse, users will wish to insert Kafka-based data into ClickHouse.”

The above ClickHouse documentation lists several options for developers wanting to connect Kafka and ClickHouse, identifying the tradeoffs of each approach:

	Available in	Comments
ClickPipes for Kafka	ClickHouse Cloud	Easy and intuitive way to ingest data into ClickHouse Cloud
Kafka Connect	Self-hosted or hosted by Confluent Cloud, Amazon MSK, etc.	Free, open-source component of Apache Kafka that works as a centralized data hub for simple data integration between Kafka and other data systems
ClickHouse Kafka table engine	Self-hosted ClickHouse (not available on ClickHouse Cloud)	Pulls data from the source system. This requires ClickHouse to have direct access to Kafka
Vector	Self-hosted	A vendor agnostic data pipeline

Besides these options, other common tools include Apache Flink or Apache Spark, which enables you to read data from a Kafka API and send the processed data to ClickHouse.

Each approach has its own pros and cons. A few key decision points are:

Connectivity. Are you okay with bidirectional connectivity for your Kafka deployment and ClickHouse deployment? You may have to set up VPC peering or firewall policies so that your ClickHouse server can talk to Apache Kafka, or the Kafka Connect can send data to ClickHouse server.
Processing Logic. Many tools can “forward” the Kafka message directly to ClickHouse, or apply simple transformation logic. Do you need stateful processing, such as tumble/hop/session window aggregation, late-event processing, or even CEP (complex event processing)?
Deployment Footprint and Cost. Tools like Kafka Connect, Flink or Spark, require a heavy cluster that you setup and maintain. There are some cloud offerings. However that will involve extra effort on network peering and cost.

A Better Way to Stream: Introducing the Proton and ClickHouse integration

At Timeplus, we value simplicity and performance. That’s why we love ClickHouse, and this is how we built Proton. With Proton 1.4 or above, we’re thrilled to provide a native Proton-ClickHouse integration that enables developers to read from Kakfa, apply streaming SQL, and send data to ClickHouse.

For developers who love using Kafka and ClickHouse together, the Proton integration with ClickHouse provides a more resource-efficient way to manage complex streaming processing & ETL.

Key benefits using the Proton-Clickhouse integration:

Proton can be deployed everywhere, with a small footprint. You don’t have to worry about connectivity between your Kafka cluster and your choice of ClickHouse deployment. They don’t have to talk to each other directly. For example, if you are using ClickHouse Cloud and running a Kafka cluster in your own data center with low latency requirements, Proton can be deployed close to your Kafka cluster behind the firewall. No need to whitelist ClickHouse Cloud Egress IP addresses in your Kafka setup. Proton is written in C++, with no dependencies to JVM or any other components. Just use the binary or docker/kubernetes to deploy Proton.
Proton is more powerful than Vector and more lightweight than Apache Flink or ksqlDB. You may use Vector to forward Kafka data to ClickHouse without much modification. Proton provides as powerful streaming SQL functionalities as FlinkSQL or ksqlDB, but in a fast and lightweight manner. You can easily migrate workloads in FlinkSQL or ksqlDB to Proton. (You can find a comparison of Proton and ksqlDB here.)
Proton efficiently offloads high frequency streaming data. When there are thousands of or even millions of new messages per second in the Kafka topics, Proton can absorb the load via streaming batching and aggregation, before sending the data to ClickHouse. This will help the ClickHouse server to reduce the merge load and to avoid the write amplification and metadata pressure. This makes the Proton solution more efficient than writing Kafka data to ClickHouse, event by event.

Proton + ClickHouse Use Cases: Better Together!

We are already seeing several exciting Proton + ClickHouse use cases, such as:

Data engineers who need a fast and lightweight solution to process streaming data in Kafka, send to ClickHouse, and offload data to a data warehouse if needed
Network operations teams who need to monitor large-volume and high-frequency network events for real-time network analysis and cybersecurity detection from edge or cloud
Fintech trading teams who need to join disparate streams of data performantly, and easily access powerful unified analytics to comply with regulations, mitigate risk, and sharpen their alpha edge

You can find more Timeplus Proton use cases here.

Tutorial

Ready to get your hands dirty? Here is an example of how to read live data in Redpanda (a fast and lightweight alternative to Apache Kafka), apply streaming ETL, and send data to your local ClickHouse or to ClickHouse Cloud. A more advanced use case is also available at the end of the blog.

A simple flow diagram to visualize the building block

A Docker Compose stack is provided at https://github.com/timeplus-io/proton/tree/develop/examples/clickhouse together with the sample SQL statements. When you start the stack, the latest version of Proton and ClickHouse, as well as Redpanda and data generator will be automatically started.

If you want to install Proton without Docker, you can just run this command:

curl https://install.timeplus.com | sh

First, create a table with the regular MergeTree table engine in your ClickHouse.

CREATE TABLE events
(
    _tp_time DateTime64(3),
    url String,
    method String,
    ip String
)
ENGINE=MergeTree()
PRIMARY KEY (_tp_time, url);

This will serve as the destination of Proton External Table for ClickHouse. Later on, you can also read the data in Proton.

In the demo docker compose stack, a Redpanda container is started, together with a data generator and Redpanda Console for you to easily explore live data. For example, go to http://localhost:8080, you will see the live data in the owlshop-frontend-events topic.

The goal of this tutorial is to read these access logs and turn the sensitive IP addresses into md5 and ingest them to ClickHouse for more business analysis.

To read data from Kafka or Redpanda, you just need to create an External Stream with the following DDL SQL:

CREATE EXTERNAL STREAM frontend_events(raw string)
SETTINGS type='kafka',
         brokers='redpanda:9092',
         topic='owlshop-frontend-events';

Then run the following DDL SQL to setup the connection between Proton and ClickHouse. For local Clickhouse without security settings, it can be as simple as:

CREATE EXTERNAL TABLE ch_local
SETTINGS type='clickhouse',
         address='localhost:9000',
         table='events';

If you are using ClickHouse Cloud, make sure the secure=true is set.

CREATE EXTERNAL TABLE ch_cloud
SETTINGS type='clickhouse',
         address='abc.aws-or-gcp.clickhouse.cloud:9440',
         user='default',
         password='..',
         secure=true,
         table='events';

Then create a materialized view to read data from Redpanda, extract the values and turn the IP to masked md5, and send data to the external table. By doing so, the transformed data will be written to ClickHouse continuously.

CREATE MATERIALIZED VIEW mv INTO ch_local AS
    SELECT now64() AS _tp_time,
           raw:requestedUrl AS url,
           raw:method AS method,
           lower(hex(md5(raw:ipAddress))) AS ip
    FROM frontend_events;

Once the materialized view is created, it will work as a background ETL job in Proton, to continuously read data from Kafka/Redpanda, apply transformations or aggregations, then send results to ClickHouse. To learn more about Materialized View in Proton, please refer to this documentation.

Now if you go back to ClickHouse and run select * from events, you will see new data coming at sub-second latency.

You can do more with streaming SQL in Proton, such as late event processing, complex event processing, or leverage thousands of ClickHouse functions to customize the transformation/enrichment logics. Many of Proton’s functions are powered by ClickHouse. So if you are a ClickHouse user already, you can use Proton in a similar way.

As mentioned above, the External Table in Proton can be used to read data from ClickHouse, or even apply data lookup in streaming JOIN. Simply run SELECT .. FROM external_table_name in Proton. It will read data from ClickHouse for the selected columns and apply the transformation or JOIN in Proton.

A typical use case, if you have static or slowly changing dimensions (SCD) in ClickHouse, you don’t need to duplicate them in Proton. Just create an external table in Proton, and you can enrich your live data by JOIN the stream with such an external table, then send the high quality data to ClickHouse.

For example:

-- read the dimension table in ClickHouse without copying data to Proton
CREATE EXTERNAL TABLE dim_path_to_title
SETTINGS type='clickhouse',address='localhost:9000';

-- read Kafka data with subsecond latency
CREATE EXTERNAL STREAM clickstream(
  ts datetime64,
  product_id int,
  ip string
)
SETTINGS type='kafka',brokers='kafka:9092',topic='clickstream';

-- continuously write to ClickHouse
CREATE EXTERNAL TABLE target_table
SETTINGS type='clickhouse',address='localhost:9000',table='pageviews';

-- downsample the click events per 5 seconds and enrich URL paths with page titles
CREATE MATERIALIZED VIEW mv INTO target_table AS
  WITH pv AS(
        SELECT window_start, path, count() AS views
        FROM tumble(clickstream,ts,5s) GROUP BY window_start,path)
  SELECT window_start AS ts,path,title,views
  FROM pv JOIN dim_path_to_title USING(path);

Summary

ClickHouse is the fastest and most resource efficient open-source database for real-time apps and analytics. Proton provides fast and lightweight streaming SQL capabilities. In this Proton’s new version, we added native integration between these two powerful engines. For the thousands of developers who love ClickHouse, we’re excited to provide a flexible solution deployable wherever you want and read or write ClickHouse data with subsecond low-latency.

You can find the Proton GitHub repo at: https://github.com/timeplus-io/proton

Give it a try and let us know what you think!

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Native Stream Processing/ETL for ClickHouse: Now Available in Timeplus Proton

Love ClickHouse and Kafka? Let Proton take it to the next level.

A Better Way to Stream: Introducing the Proton and ClickHouse integration

Proton + ClickHouse Use Cases: Better Together!

Tutorial

Summary

Related Posts