Originally posted on Data Streaming Journey by Yaroslav Tkachenko.
Gang Tao, our Co-Founder and CTO, joined Yaroslav Tkachenko for a discussion on Proton, our open source streaming and historical data processing engine, as well as trends in building streaming databases, and choosing between C++ and Rust.
Hi Gang, could you introduce Proton to the readers?
Recently, we have open-sourced Proton. Proton is our core streaming database that powers Timeplus to do the streaming analytics. At Timeplus, we want to make your analytics workloads as fast as possible. And, typically, you have two directions to solve this problem.
One is stream processing. With stream processing, you continuously monitor the data. As soon as a new event comes, you do some incremental processing with internal state so you don't have to recalculate things from scratch. This solves your problem with a very low latency because as soon as data comes, you get results immediately.
But there's a big problem with stream processing: a lot of time, you also need to backfill using historical data. And most of the stream processing technologies leverage something like Kafka, which is an append-only log. You can immediately get the latest data with sequential reads. But it's not good enough when you scan historical data because, with an append-only log, it's expensive to store a lot of data. Also, you cannot skip data because when you read from a specific point in time, you must read from that point and read all the data. So, that's not designed for historical data processing.
The other direction is real-time OLAP databases: Pinot, Druid, ClickHouse, Doris, etc. A lot of them have some kind of indexing, columnar store, and/or vectorized processing, which makes scanning a large amount of data super fast because it can skip a lot of data. Also, it leverages a compute engine to do some optimizations. So you can perform quick scans. However, the issue with real-time databases is that the data is not fresh. Because when you get new data, you have to build the index. This means, that before the index is there, you actually cannot query that data, which means high data latency.
So, there are two sides to the solution. This is what we want to solve. In most cases, a user doesn't just perform real-time streaming or historical processing. They actually have a hybrid workload. Typically people talk about the Lambda architecture or Kappa architecture, but they only solve part of the problem. What we do differently: we have unified storage: the data is stored with the distributed append-only log and also the historical part. And our engine knows where to read the data. So the user doesn't need to worry about whether it's real-time streaming or historical processing.
Proton is built on top of ClickHouse. Could you elaborate on why you chose ClickHouse specifically? And not any other technology, or not something you maybe build in-house?
As I mentioned, our target is to support analytical jobs. And we didn't want to build from scratch because we just wanted to start quickly and solve the problem. We actually checked all available solutions. For the analytics, we wanted to do something like Flink.
Flink has already solved the stream processing problem. And there are some real-time databases solving the historical scan issue. We wanted to have a solution that unifies them. Actually, what a lot of people are doing now is they have Flink to do real-time data transformations, and then they put results into ClickHouse or Druid.
So, to simplify this, we looked for the best option for us to build a unified engine. You can create something Java based, like Pinot or Druid. There are others, like ClickHouse, which is C++ based. At that time, we evaluated everything. And because a lot of us came from Splunk, we wanted to build something better than Splunk.
Splunk is written in C++. So many of our team members have a lot of experience with it. And when we looked at ClickHouse, we found it has a good foundation not only because it's fast but also because its architecture is very elegant. It has very good extensibility. E.g. table engines, table functions - it's very easy to add something to ClickHouse. So we looked for a solid foundation which we wanted to extend with streaming, and ClickHouse was a very good choice at that time.
In my opinion, when I look at the architecture of Proton, the unified query processing engine is one of the most interesting parts. How does it decide between doing streaming reads and historical reads?
We have a technology called the Timeplus data format. Timeplus data format is a unified data structure that defines how the data looks like. The append-only log and the historical data store actually share the same data structure. There is also metadata in this data structure that indicates if the data is currently a part of the streaming append-only log or the historical data store.
It's truly unified! So, when it scans the data, it can see how many records are on the streaming side vs historical. During the SQL query planning, we know where everything is located, so we can easily backfill the historical data into the stream query.
Does it look at some kind of timestamp?
Yes, timestamps are parts of the metadata.
Got it! If you look at the Proton architecture diagram, it does look like it has a lot of moving parts. What consistency guarantees do you have? And how do you test them?
Even though it looks like there are many parts (the SQL engine, ingestion, streaming and historical storage), there is actually a single process. Everything is built into a single binary. You don't have to worry too much about consistency because everything is in a single process. No need for the two-phase commit and such.
But to be honest, at this point, we only support at-least-once delivery, we don't support exactly-once. We can support it, but at this phase, we chose a simpler approach. We haven't spent enough time on a better checkpointing mechanism yet.
We do have checkpointing. But we don't perform it continuously, it’s interval-based. That means, if you have a materialized view and you need to rebuild it, it is possible that some of the data is duplicated. So, to summarize, the whole architecture is a single component. We can implement exactly-once if we spend some time on that (when the users ask).
But there are other parts: sources and sinks. They're actually different components. For example, regarding sources, Kafka is the most popular one. We can leverage consumer groups and the offsets to ensure exactly-once delivery. But for the sinks, it's something that the client should handle. We do provide an offset so the client knows which offset they've already consumed. And when something like a restart happens, a client can resume from a specific offset.
As far as I know, Proton, as well as Timeplus, only supports SQL as an interface. And in your journey so far, did you find that it's always enough? Or did you get requests for the programmatic APIs? If not, maybe users have to rely on some custom UDFs? What is your experience so far?
If you’re talking about the SQL language, there are actually several generations. The first generation is a machine code, nobody understands it. The second generation is Assembly; you can understand a little bit, but it is still hard to use. The third generation is languages like C++ , Python, Java. SQL is the fourth generation. It's descriptive; you describe what you want, and you get things magically done.
But it is good because more people understand SQL than Java or C++. However, SQL is not as flexible as just writing code, for example, when you want to loop over something.
SQL is actually Turing complete (thanks to recursive CTEs), but still, some things are really hard to describe using SQL. In the case of our customers, 80% of the jobs or even more can be described as SQL. But you always have something that cannot be done with SQL, especially AI-related.
Proton and ClickHouse are both implemented in C++. But historically, a lot of data processing engines actually used JVM languages like Java and Scala. Nowadays, personally, I see an explosion of interest in Rust.
You mentioned that many early employees in Timeplus came from Splunk, and you used C++, so it was a natural fit. But imagine for a second that you don't need to use ClickHouse. Would you still use C++ because of all the people's experience? Also, how hard is it to hire new people? Specifically with C++ and database and/or streaming engine experience.
Let me start with the latter question. The most challenging thing for us is definitely hiring people. Especially C++ engineers with distributed processing or database background. Talking about Rust and C++, I definitely see a trend in the open source or data community. It's just rewriting existing tools in Rust. And that sounds really cool. But to me, if you create something exactly the same, you don't create value, right? I don't know if you heard the story of RisingWave, they actually started with C++ and then rewrote everything in Rust.
I guess because they have resources, but also maybe because they have a vision that in the future, their Rust investment will pay off. I think the most important thing is not whether to use C++ or Rust; it's what extra value you're creating. That's more important. And maybe an engineer cares about using Rust or some other new programming language because engineers like something new and cool.
But the customers don’t care. They just want something to solve the problem. And if you solve the problem, they don't care if it's Java, C++ or Rust. But I would like to say that assuming you're starting from scratch, Rust is a good choice. However, my advice is that you build something valuable instead of just rewriting things. It makes no sense to me.
Finally, what was the most challenging part to implement in Proton?
So, we already had ClickHouse. We started adding stream processing to it. And we started with a single-node deployment. That's a very good starting point. We have low latency. But then we started adding support for clustering. And in this case, you have to deal with distributed state management, distributed consistency; things get complicated. I think that's why Flink is so good. After many years of development, it's quite stable. But clustering support also means some additional performance cost. So, I think for us, the most challenging part is still everything about building a distributed system.
Thank you! Any closing words?
Thank you for having me! I just wanted to say that we open-sourced Proton, and we would be very happy if it could bring some value to the community. Please let us know if you have any feedback, and we are very happy to work with the community to build something useful.
This interview was originally posted on Data Streaming Journey by Yaroslav Tkachenko.
Ready to get started? Proton is the open-source engine from Timeplus, a unified streaming and historical data processing engine in a single binary. Try it yourself by launching with Docker.