A Unified Architecture: How Timeplus Bridges Historical Backfill and Real-Time Processing

Li Yang
May 22
9 min read

By Li Yang, Senior Engineer at Timeplus

The hard part of backfill is not reading old data once, but switching to real time within the same query.

In a stream processing system, backfill means a streaming query first processes historical data, then seamlessly switches to real-time data and keeps running.

Some use case examples:

AI infrastructure and large model monitoring (LLMOps): Large-scale log recomputation: When adjusting token billing models or safety guardrail rules, the system needs to rapidly replay a massive historical conversation log store to update billing or audit results, then seamlessly connect to the latest real-time model gateway.

Real-time feature engineering (ML / Feature Store): State warm-up: When launching window features such as “user behavior statistics over the past 24 hours,” historical logs must first be replayed to fill the window state before connecting to the real-time clickstream. Otherwise, model predictions during cold start may fail because feature values are empty.

Quant finance and algorithmic trading (Quant Trading): Same source for backtesting and live trading: When developing trading strategies, the same SQL can be used to quickly run through years of historical market data for backtesting. After the scan finishes, the system automatically and seamlessly switches to live market data streams for scoring.

Change data capture and real-time data warehouses (CDC / Real-time DWH): Snapshot and incremental handoff: When creating a new dashboard or materialized view, the system first pulls a full historical database snapshot in batch mode, then precisely locks onto the offset for consuming incremental binlogs as a stream. During the handoff, it must neither lose data nor create primary key conflicts.

Observability and security risk control (Observability / SIEM): Rule backtesting and tracing: After adding a new anti-fraud or alerting rule, the system first scans massive logs from the past 30 days to validate rule accuracy and trace missed events. The same logic then remains active to intercept new traffic.

The real difficulty is not whether the system can start reading from the earliest data, but rather:

The historical phase and the real-time phase must share the same execution pipeline, instead of being two separate jobs.
The switching point must be precise: no missed reads, no duplicate reads.
The historical phase must be fast enough; otherwise, the semantics may be correct, but the system is not practical.

Thus, backfill is not simply an offset rewind problem. It is an execution architecture problem for a unified streaming and batch system.

Technical Evolution: NativeLog -> HistoricalStore + NativeLog -> HistoricalStore with Parallel Scan + NativeLog

Timeplus’s backfill capability first ensured semantic continuity, then evolved to improve historical backfill efficiency.

Stage 1: NativeLog

Historical replay and real-time consumption both come from the same streaming log.

Stage 2: HistoricalStore + NativeLog

The historical phase switches to columnar scan, while the real-time phase is still provided by NativeLog.

Stage 3: HistoricalStore with parallel scan + NativeLog

The historical phase supports multi-threaded parallel scan, then seamlessly switches to NativeLog.

Stage One: Backfill Only on NativeLog

In the early stage, Timeplus implemented backfill by replaying directly from the earliest position in NativeLog, then continuing to consume the latest streaming data.

This stage first solved the most important semantic problem:

Historical and real-time data come from the same log sequence.
The switch is naturally continuous, and users only need one streaming query.
seek_to='earliest' can trigger replay from the earliest position.

But two bottlenecks quickly appeared. First, historical depth was limited by retention. NativeLog is a streaming storage layer. It is good at low-latency writes and real-time consumption, but it is not designed for unlimited long-term historical retention. If backfill relies only on NativeLog, the backfillable range is directly constrained by retention.

Second, the historical path was still replay, not scan. Even though NativeLog uses a native block format and is much faster than Avro/JSON decoding, its historical read path is still essentially replaying data in log order:

Throughput is mainly bounded by the replay path.
It is hard to split work by part in parallel like a column store.
It is hard to raise historical reads to batch-level scan efficiency.

In other words, the first stage solved “how to stay continuous,” but did not fully solve “how to support high throughput and long history.”

Stage Two: HistoricalStore + NativeLog

The second step was to separate “historical reads” and “real-time consumption” from the same physical storage:

NativeLog is responsible for real-time writes, low-latency streaming reads, and the tail phase after switching.
HistoricalStore is responsible for long-term historical retention and batch-style scans.

This step is critical, because backfill is no longer treated as “replaying the same log once.” Instead, it is explicitly modeled as two phases:

Historical phase: scan historical data from HistoricalStore.
Real-time phase: switch back to NativeLog and continue processing newly written data.

As a result, backfill evolves from replay mode into a unified execution model of historical batch scan plus real-time stream consumption.

Stage Three: HistoricalStore with Parallel Scan + NativeLog

The third step was to make the HistoricalStore phase support multi-threaded parallel scan.

This means the current architecture already has three layers of capability:

HistoricalStore handles high-throughput historical scans.
NativeLog handles low-latency real-time consumption.
The execution engine performs the switch within the same query.

The fundamental reason for the leap in backfill throughput is not “making replay faster.” It is changing the historical phase from replay to columnar scan, and then further making that scan parallel.

Flexible Backfill Starting Points with seek_to

`seek_to` settings of Timeplus query supports not only `earliest`, but also a specific SN/offset or event time.

From the user’s perspective, triggering backfill only requires one SQL query. By adjusting the seek_to parameter, the backfill starting point can be controlled flexibly:

Full backfill: seek_to = 'earliest', recompute from the earliest available historical position.
Precise breakpoint recovery: seek_to = '1024', resume from a specified underlying sequence number (SN) or offset. This is commonly used for failure recovery or breakpoint continuation for external message streams such as Kafka.
Business-time slicing: seek_to = '2024-01-01 00:00:00' or a relative time such as '-1h'. In real business scenarios, defining the backfill range based on event time is often the most intuitive approach.

This turns backfill from a simple historical catch-up into a flexible tool that supports precise breakpoint recovery and time-sliced data repair.

Current Implementation: HistoricalStore Handles the Historical Phase, NativeLog Handles the Real-Time Phase

Timeplus now maintains two storage layers for each stream and places both layers into the same execution model.

For the Append Stream discussed in this article, HistoricalStore uses MergeTree columnar storage, which provides:

Columnar compression: significantly reduces I/O for historical backfill.
Vectorized execution: processes data by block/column, instead of decoding row by row.
Part-level parallelism: naturally supports multi-threaded historical scans.
Long-term retention: historical depth is no longer directly constrained by NativeLog retention.

In other words, writes still preserve the low-latency characteristics of a streaming system, while historical backfill is no longer limited by the throughput ceiling of log replay.

StreamingConcat: Switching Sources Within the Same SQL Query

When a query needs backfill, Timeplus enters StreamingConcat mode:

enum class QueryMode : uint8_t
{
    Streaming,        // pure real-time stream
    StreamingConcat,  // historical backfill -> real time
    Historical        // pure historical query table(stream)
};

From the execution engine’s perspective, the process can be summarized in four steps:

Determine the switching upper bound when the query starts: record the latest sequence number (SN) in NativeLog.
Scan historical data from HistoricalStore: read from the starting point specified by seek_to up to this upper bound.
After the scan finishes, switch to NativeLog: continue consuming new data from the position after that upper bound.
Keep downstream operators unchanged: windows, aggregations, filters, and joins all continue to reuse the same pipeline.

The most important engineering implication is that the switch happens at the source layer, not the SQL layer.

Users only need one query, and can choose whether to start from earliest or a specified breakpoint:

Different forms only change the starting point. The switching mechanism is exactly the same.

What Makes the Switch Seamless

Seamless does not simply mean concatenating two result sets. It means satisfying all of the following:

No missed reads: new data written between the end of the historical phase and the start of the real-time phase must not be lost.
No duplicate reads: data before the historical upper bound must not be processed again after the switch.
Same semantic surface: both historical and real-time data enter the same windowing, aggregation, and state machine logic.

This is the essential difference between a unified streaming and batch system and a stitched “offline job + online subscription” approach. The former builds source switching into the execution engine. The latter usually requires the application layer to handle deduplication, alignment, and breakpoint continuation.

Ordered and Unordered Backfill

To maximize throughput, parallel backfill tends to output data out of order by default. But if the query semantics depend on event-time order, Timeplus chooses ordered mode.

Query Type	Requires Ordering	Reason
Global aggregations such as count() and sum()	No	Row order does not affect the final result.
tumble + emit_during_backfill	Yes	Window output needs to advance by event time.
lag() / lead() / time-series stateful operators	Yes	They depend on the order of previous and following events.
Pure detail backfill to downstream systems	Depends on downstream semantics	Unordered output is the default, prioritizing throughput.

To explicitly force ordered mode:

This shows that the current evolution is not merely “moving historical data to columnar storage.” It also makes a controlled tradeoff between parallel historical scan and ordered streaming semantics.

Real Performance Comparison: Redpanda Replay (Avro) vs NativeLog Replay vs HistoricalStore + NativeLog Seamless Switching

Test method: local MacBook Pro M3 Max, 64 GB, Timeplus Proton v3.1.3 single node. bench_append was preloaded with 100 million rows. Queries used FORMAT Null to avoid interference from network output, focusing on read-path throughput.

Approach	Historical Read Path	Measured Speed	Estimated Time to Backfill 1 Billion Rows
Redpanda replay (Avro)	Kafka API + Avro decode + sequential replay	~0.5 M rows/s	~33 min
Timeplus NativeLog replay	Native block sequential replay	~50 M rows/s	~20 sec
HistoricalStore + NativeLog (1 thread)	Columnar historical scan + seamless switch to real time	80 M rows/s	~12.5 sec
HistoricalStore + NativeLog (2 threads)	Columnar historical scan + seamless switch to real time	117 M rows/s	~8.5 sec
HistoricalStore + NativeLog (4 threads)	Columnar historical scan + seamless switch to real time	193 M rows/s	~5.2 sec
HistoricalStore + NativeLog (automatic parallelism / aggregation)	Columnar historical scan + seamless switch to real time	457 M rows/s	~2.2 sec

This data shows a very clear technical evolution curve:

Redpanda replay (Avro) -> NativeLog replay: from protocol message replay to native block replay, improving throughput by about 100x.
NativeLog replay -> HistoricalStore + NativeLog: from log replay to columnar scan, still improving by about 1.6x with a single thread.
HistoricalStore -> HistoricalStore with parallel scan: after scaling historical scan to multiple threads, overall throughput further increases by 3.9x to 9.1x.

Why HistoricalStore + NativeLog Is Another Order of Magnitude Faster

The key is not just “more threads.” The read model has changed.

Redpanda / Kafka replay: Read message -> parse schema -> decode field by field -> construct row object -> send to executor

NativeLog replay: Read Native block -> replay blocks in log order -> send to executor

HistoricalStore backfill: Read parts by column -> decompress column segments -> vectorized computation -> scan multiple parts in parallel -> send to the same executor

The core differences are:

The historical phase is no longer replay, but scan.
The scan is no longer row-wise, but columnar.
Columnar scan can be naturally parallelized, while replay is hard to split at the same level.

So the performance improvement at this stage is not a small protocol-level optimization. It comes from changing both the storage format and the execution model.

What This Data Really Shows

If we only look at “how many rows can be read at peak speed,” it is easy to misread the conclusion as “columnar storage is faster than Kafka.”

A more accurate conclusion is:

NativeLog guarantees low latency and switching continuity for the real-time phase.
HistoricalStore guarantees high throughput and long retention for the historical phase.
The execution engine combines both into one query path.

In other words, performance does not come only from HistoricalStore, nor only from NativeLog. It comes from the combination of HistoricalStore for history, NativeLog for real time, and the execution engine for transformation.

The Real Value of Unified Streaming and Batch Backfill

From an engineering implementation perspective, Timeplus does not solve just one isolated performance problem with backfill. It moves a problem that previously had to be stitched together at the application layer down into the database execution layer.

The direct results are:

The historical phase uses a batch-style high-throughput path: HistoricalStore provides columnar scan and parallelism.
The real-time phase uses a streaming low-latency path: NativeLog provides continuous consumption and real-time append.
The switch happens inside the engine: users maintain only one SQL query, one set of state semantics, and one downstream interface.

As a result, application teams no longer need to handle:

Aligning results between offline and online jobs.
Deduplicating and stitching historical and real-time results.
Preventing missed data and recomputation during the switching window.
Maintaining two pipelines over the long term.

This is why, once backfill becomes an execution-layer capability, its value is not only “faster,” but also “more usable, more stable, and easier to maintain.”

Summary

Timeplus first used NativeLog to solve how to continuously replay and catch up to the real-time stream, then used HistoricalStore to solve how to make historical backfill deep enough, fast enough, and scalable enough.

Architecturally, this maps to three steps:

NativeLog: first solve unified streaming semantics and seamless catch-up.
HistoricalStore + NativeLog: upgrade the historical phase from log replay to columnar scan.
HistoricalStore with parallel scan + NativeLog: scale the historical phase to multi-threaded parallelism while preserving streaming semantics.

At the same time, seek_to is no longer just a switch for “start reading from earliest.” It is a mechanism for precisely controlling the backfill starting point:

seek_to='earliest': start from the earliest available history.
seek_to='<sn-or-offset>': resume from a specified underlying breakpoint.
seek_to='<timestamp>': perform sliced data repair based on business time ranges, supporting both relative and absolute time.

Only when all of these things are true does backfill truly move from “can replay” to “production-ready.”

Appendix: Reproducible Timeplus Test Commands

Note: The Redpanda replay (Avro) data in the article is used to illustrate the historical replay baseline on an external messaging system. The appendix focuses on Timeplus-side commands that can be reproduced directly.

Get started with Timeplus Enterprise: timeplus.com/download

Join our Slack: timeplus.com/slack

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

A Unified Architecture: How Timeplus Bridges Historical Backfill and Real-Time Processing