The Save Game Disaster: When Timeplus Proton Learned to Checkpoint Smarter
- Gang Tao

- Dec 8
- 3 min read
Updated: 7 days ago
RocksDB-Backed Incremental Checkpoints
Remember old video games where you hit "Save" and the whole game froze for a second? That was exactly what happened with Timeplus Proton's early checkpoint system.
When you're playing a long game, you periodically save your progress. If the game crashes or you make a mistake, you can reload from your last save instead of starting over from the beginning.
In streaming processing systems, a checkpoint is exactly that: a snapshot of your processing state at a specific moment in time. It captures:
Where you are in the input stream (which messages you've processed)
All the intermediate calculations and aggregations you've built up
Any state you're maintaining (like counts, joins, windows, deduplication sets)
When you only had a handful of materialized views (the streaming processor of Timeplus), the simple approach worked fine:
Fixed save interval timer
Serialize all in-memory state
Write everything to disk
Resume playing
Then users started running hundreds of materialized views on a single server. Each one needed its own checkpoint. Suddenly, every save interval meant:
Serializing hundreds of large in-memory structures
Writing massive files to disk
CPU stalls (the game freezes!)
I/O spikes (the disk screams!)
Your smooth streaming pipeline turned into a stuttering mess. Players (operators) were not happy.
We recently released Proton 3.0, Proton 3.0 introduced a game-changer to improve this situation: RocksDB-backed incremental checkpoints.
Instead of saving your entire game state every time, modern games only save what changed since the last save. RocksDB's LSM engine stores data in immutable SSTable files organized into levels.
A checkpoint is just:
A manifest
A list of SSTables "in scope" for this snapshot
To build an incremental checkpoint:
Track which SST files were already in the previous checkpoint
Only copy/upload newly created SSTs
Keep old ones by reference (often via hard links)
Example incremental checkpoint for aggregated count state:

Proton's checkpoint coordinator treats the LSM as an append-only log of SSTables. Recovery just reconstructs the exact snapshot without rewriting unchanged data. For hybrid hash aggregations already living in RocksDB? We capture crash-consistent snapshots and only ship delta SST files between versions.
Combine this with asynchronous checkpoint strategy, the main execution threads aren't blocked anymore.
But wait! The team still had to solve: How often should we save?
The old proton system had a single checkpoint.interval configuration.
Short intervals means resuming closer to "now" after a crash, but frequent write bursts.
Long intervals means less overhead, but more data to reprocess on restart.
The first attempt was a hardcoded 5-second base interval. Tested fine in small ETL scenarios. And we shipped it.
Then the dashboards lit up like a Christmas tree.
Production instances with hundreds of MVs started reporting widespread ingest_timeout errors. What went wrong?
The Prometheus metrics told the truth:
CPU was healthy
Memory was healthy
Disk IO: Alarm!
The 5-second checkpoint interval was creating relentless I/O pressure, especially on nodes hosting both historical storage AND the streaming log (refer to Timeplus architecture design). Every 5 seconds, hundreds of MVs tried to checkpoint simultaneously. The disk couldn't keep up. Game over!
The fix includes both operational and code changes:
Operational Fix: Split storage! Historical tables and nativelog should live on separate disks, reducing contention between large merges and small checkpoint writes.
Code Fix: Instead of one magic number, introduce a three-tier checkpoint system:
checkpoint_settings:
min_interval: 60s # Floor - auto can't go lower
light_state_interval: 5s # Lightweight ETL views
heavy_state_interval: 15m # Heavy aggregation stateThe auto-interval logic now:
Separates "light" and "heavy" workloads
Enforces a min_interval floor so auto mode can't become too aggressive
Lets lightweight ETL views save frequently (5s) without dragging down heavy aggregations
Combined with RocksDB incremental snapshots, single-instance deployments now stay stable as the number of materialized views grows, while preserving predictable recovery behavior.
All these improvements about the checkpoint mechanism were based on the customer pain points in real scenarios. What we learned from this:
Incremental is better than full saves (when you have RocksDB-backed state)
Not all workloads are equal (light ETL ≠ heavy aggregations)
Magic numbers hurt (one 5-second interval nearly broke production)
Async is your friend (don't block the main thread)
Storage matters (separate your hot write paths!)
Modern Proton checkpoints are like a well-designed save system in a 3A game: mostly invisible, highly efficient, and you only notice when something goes wrong. Which, now, it doesn't.
Want to try it? Check out Timeplus Proton here: https://github.com/timeplus-io/proton/


