top of page
  • Writer's pictureTing Wang

The Timeplus Journey to Open Source: Introducing Proton

You have to be a little crazy to start a data infrastructure company.


It demands giving up so much in the pursuit of trying to build something great. Career security. Sleep. Weekends.

But as engineer entrepreneurs, we’re lucky to have the opportunity to build products that solve problems that drive us insane. That’s what led me to found Timeplus: I experienced firsthand how hard it is for developers to build powerful real-time streaming analytics and manage it efficiently. I knew solving this headache would be a huge challenge, but sometimes it takes just a little bit of crazy to solve a maddening problem.


Let’s step back for a bit - here’s the challenge for “simple” unified analytics. Imagine you’re a runner, training for a 5k, and you want to be faster than your friends, who are also going to run that race. The only way this will happen is if you run faster than you’ve done before. Wouldn’t it be great if your watch could continuously answer this simple question: “how close am I to my best run?” There are lots of ways to answer that: average speed over the last mile, average speed over the last few minutes, how fast you were during this part of your best run, etc. During training runs, you can experiment and decide if the best strategy for you is to slowly increase your pace; or start fast, slow into a rhythm, sprint at end; or something else perfectly tailored for your running style. But fundamentally all those questions compare history to what is happening right now - a unified blend of live and historical data.

Now say it’s race day: you line up with your friends, ready to go, and you look down at that watch. But now, imagine if you could expand those questions: “how close am I to all my friends’ best race, what did they do during this point, and what should I do next?” What if the dataset wasn’t just limited to you and your buddies, but everyone who’s ever run that race, in every weather condition, with insights into every second-by-second decision the runners made, both in the past and in real-time with equal visibility into everything that’s ever happened plus everything that’s happening now?


That is the promise of unified analytics: to extract winning patterns and strategies from everything known about moments similar to the one you must act on. You have to decide what to do, but now you can instantaneously decide using timely insights, based on the most complete foundation: historical + live data.


I’m an engineer at heart. When faced with the maddening challenge of building “simple” unified analytics, my instinct was to go head-down and build. And build we did. I’m so proud of the amazing work our team has accomplished over the last 18 months. It’s immensely satisfying to have built something that elegantly solves fellow engineers’ problems.


We’ve built a great foundation for unified analytics. But I’ve realized that if our goal is to have engineers rethink analytics entirely, we need to be more ambitious. We need to not fear experimentation; we need to encourage it. We need to go open source.


What’s next: open-source Proton


I’m excited to announce that Timeplus has licensed its core streaming processing engine “Proton” (yes, we love physics) as open source to the community globally under the Apache 2.0 license.


Proton provides unified streaming analytic capabilities in a single binary. We strongly believe stream processing alone is far from enough for many analytic use cases. Solving “root cause” problems in real-time requires the ability to reference historical data to get the job done right. We chose ClickHouse's unparalleled columnar database engine for historical analytics, and complemented that with our own stream processing.


Leveraging Timeplus and ClickHouse, developers can now experiment with novel ways to solve their own analytics problems.


They can now natively and seamlessly connect streaming and historical data to solve complex real-time analytics use cases, such as online and offline correlation, backfill, and backtesting. Engineers can easily run unified analytics, solving challenging analytic problems across a number of industries, and they can do it faster and cheaper than before—in case studies, customer total cost of ownership with Timeplus was 10% that of using other streaming/real-time frameworks. All with one streaming SQL.


Some specific technical details around our open-source codebase:

  • Our core streaming engine, Proton, is built on top of a single instance ClickHouse codebase and has major different business focus. We highly respect ClickHouse’s engineering perfection and the simplicity, performance, and efficiency of the project. We waited to release Proton until we had built a project worthy of being in conversation with the extraordinary work of the ClickHouse team.

  • Proton is purposely designed for stream processing and highly optimized for time-related data. Some previous academic works, such as Stanford STREAM and MIT AURORA, largely impact its system model, which also adopts latest technology advancements in stream processing.

  • To support stream processing, we designed and implemented our own streaming store, which is a distributed Write Ahead Log (internal code name ‘NativeLog’) to support sequencing Timeplus events. The Timeplus Data Format minimizes the data serialization / deserialization among the streaming store, historical data store, and in-memory representation. We also optimized time-related query processing, like timestamp predicate push-down and projection push-down while also enabling vectorization streaming processing.

  • For streaming query processing, we designed and implemented our own watermark system, which reasons about the completeness in stream processing together with the sequences of the data in the streaming store. We also added more critical streaming processing functionalities, such as data shuffling, substream, concurrent streaming join, multiple shards processing, tumble, hop, session window, and data revision (a superset of CDC). Most importantly, we start from anew to support the streaming query processing state checkpoint / recovery.

  • Streaming SQL can’t resolve all use cases, so we extend the core engine to support JavaScript User Defined Function (UDF) and User Defined Aggregate Function (UDAF). The latter enables powerful contextual / state machine based streaming analytic capability.

  • Last but not least, if users don’t want index data to Timeplus, they can ‘streaming query on external stream’ directly as a federated streaming query. For example, users can run a streaming query on an external Kafka or Redpanda topic.

  • For more technical, licensing, and usage details, take a look at the FAQ for the open source release of Proton.

Proton is the heart of the heads-down building we’ve focused on for the last 18 months. Our hope is that Proton becomes the engine for novel experiments that push this community, collectively, toward solving our most maddening problems.


To get started, check out Proton on GitHub. Get started in just a few minutes with a preconfigured Docker image, your mission critical data, and a willingness to experiment.


Go crazy with Proton!


Read "Part II: How We Got Here" to learn more about the backstory of our journey.


댓글


bottom of page