Detecting Anomalous Metrics Using Timeplus

Junyue (Carol) Xue
Oct 3, 2025
9 min read

My name is Junyue (Carol) Xue, and I'm a third-year student at the University of British Columbia (UBC). I'm currently interning at Timeplus, where I work with Gang Tao on a real-time anomaly detection project. This blog combines technical details with personal reflections on applying classroom concepts to real trading systems.

Why System Stability Matters

During my quantitative trading internship, I learned firsthand how critical system stability is. Trading strategies must respond to millisecond-level signals. When CPU or memory indicators show anomalies, the entire strategy can experience delays or miss trading opportunities entirely.

Initially, when trades were delayed or rejected, I instinctively debugged the strategy itself. Over time, however, I realized that many issues weren't caused by strategic errors at all—they stemmed from dirty data and infrastructure anomalies: memory usage spikes, slow responses from the order management engine, or hidden queue buildups deep in the DevOps stack.

Understanding System-Level Anomalies

Many system-level anomalies are actually driven by spikes in business activity. For example:

When users place a large number of orders (tracked as openCount), the system initiates order processing logic, directly increasing CPU and GPU usage
During predictable busy periods—such as 9:00 AM market open—user activity surges before infrastructure metrics spike, sometimes with a 5–10 second delay

If we examine DevOps metrics in isolation, we might misclassify a healthy traffic spike as an anomaly. By incorporating business context, we can distinguish between expected high load and genuinely abnormal behavior.

Choosing the Right Approach: Rule-Based vs. Data-Driven

When building an anomaly detection system for production environments, one fundamental decision is whether to use rule-based or data-driven methods.

Rule-based systems are highly interpretable—each rule is grounded in well-understood business logic. However, when clear rules don't exist and data carries inherent uncertainty, we must turn to data-driven approaches for anomaly detection.

How to Detect Anomalies

Rather than treating anomaly detection purely as a statistical outlier problem, we reframe it as a regression-based prediction task.

Traditionally, many teams use tools like Prometheus + Grafana, ELK, or AWS CloudWatch for anomaly detection. While these solutions offer flexible visualization and rule-based configuration, they rely heavily on static thresholds or simple statistical rules. Prometheus lacks built-in machine learning capabilities, and AWS CloudWatch is limited to the AWS ecosystem. Most importantly, these traditional tools struggle to model the complex relationships between business behavior and system performance.

The Timeplus Advantage

Timeplus offers a unified real-time data streaming platform designed for modern anomaly detection:

Unified real-time data streams – seamlessly integrates both business and DevOps metrics without additional data pipelines
Built-in machine learning – supports regression modeling, residual-based detection, and more
Low-latency alerts – combines PyCaret-based regression predictions with residual detection to flag anomalies before they impact critical systems

This blog explores how I built a real-time anomaly detection system with Timeplus to identify potential issues before they affect trading strategies. The platform enables both traditional methods (detecting deviations from historical patterns) and advanced regression-based approaches that integrate business and DevOps metrics in real time, moving us beyond simple rule-based alerts.

An End-to-End Real-Time Machine Learning Flow

In this guide, we build a complete end-to-end example: predicting per-host CPU usage from the business activity metric openCount at 10-second intervals, learning a regression baseline, and flagging residual spikes as anomalies.

Unlike traditional batch machine learning, every step runs in-stream on Timeplus:

Data collection
Temporal alignment and windowing
Feature creation
Cleaning and normalization
Training handoff
Live inference

Data Collection

We work with two primary data types:

Data Type	Raw Examples	Source	Generation Process	Final Metric	Usage
Business Activity	request_log, api_call	Application server logs, API tracing	Aggregated over 10s windows, counting requests per host	openCount	Main independent variable: predicts CPU usage based on request volume
System Performance	cpu_usage_raw, mem_raw	System monitoring agents (Prometheus, Datadog, etc.)	Within 10s windows, calculates average or peak CPU usage, memory, etc.	cpu_usage	Prediction target: predicted from openCount; residuals flag anomalies

Understanding the Schema

When integrating production data, you can retrieve and understand stream schemas in two ways:

View schema directly from the data source: If your data source is Kafka, Pulsar, or another platform supporting schemas, create an External Stream in Timeplus to automatically fetch field definitions and data types
Use SQL to describe the stream: In Timeplus, use the DESCRIBE STREAM command to quickly view field names, data types, and comments

For this anomaly detection use case, we focus on three core fields:

CREATE STREAM default.t_metrics
(
    'event_ts'  uint64,   -- event timestamp in ms
    'metric'    string,   -- metric name (e.g. openCount, cpu_usage)
    'value'     float64   -- metric value
);

Feature Engineering

Feature engineering transforms raw data into meaningful features that help models learn relationships between inputs and outputs. At its core, a machine learning model fits a function. If we feed scattered, noisy raw data directly to the model, it often struggles to identify patterns.

Feature engineering bridges this gap by converting unstructured data into structured, informative inputs—for example, aggregating related metrics into a shared time context or creating temporal features like lags and rolling statistics.

From Narrow to Wide Tables

In Timeplus, production metric data is typically stored in a narrow table format, where each row represents a combination of timestamp + metric name + value. This design makes adding new metrics easy—simply append new rows without modifying the table schema.

Example narrow table:

event_ts	metric	value
1723852200000	cpu	0.75
1723852200000	openCount	123
1723852210000	cpu	0.78
1723852210000	openCount	140

During feature engineering, we need to transform this into a wide table where each timestamp corresponds to a single row, with each metric as a separate column (one column per feature). This format is essential for model training and residual calculation.

Example wide table:

_tp_time	cpu	openCount
1723852200000	0.75	123
1723852210000	0.78	140

The Transformation Process

This materialized view aggregates the real-time narrow table into fixed 10-second tumbling windows using tumble(..., 10s). Within each window, all (metric, value) records are packed into an array using group_array, then array_first extracts values by metric name to convert the narrow vertical table into a wide horizontal table.

The process works as follows:

group_array((metric, value)) – Combines all (metric, value) pairs within the same 10-second window into a single array
array_first(...) – Finds the value corresponding to a specific metric name and assigns it to the appropriate column. For example, the value of cpu_usage is extracted and placed into the cpu_usage column

Each row now represents the complete set of features for a given host within a 10-second window, such as openCount and cpu_usage. This completes the core feature engineering: aligning metrics along the time dimension and expanding feature columns, providing directly usable data inputs for model training and real-time prediction.

Model Training

We assume a linear relationship between CPU usage and openCount. In Timeplus, we leverage a Python UDF (User-Defined Function) to train a regression model. The model predicts the expected CPU value under a given business load in real time. After generating predictions, we compare them with actual CPU usage to calculate residuals. If residuals significantly deviate from expectations, we flag that time point as an anomaly.

We start with a simple linear model for initial evaluation, only considering more advanced methods (Isolation Forest, kNN, PCA) if residuals exhibit clear nonlinear patterns or frequent false positives.

Understanding Python UDFs

Timeplus introduces Python UDFs, enabling users to seamlessly embed Python functions within SQL queries (see official documentation).

Batch Training Approach

In this end-to-end example, regression model training uses historical data in a single batch call, not streaming replay. Since linear regression depends only on the overall distribution of all historical samples and not on temporal event order, chronological replay isn't necessary.

The training process:

Uses table(...) to enforce bounded batch reads instead of unbounded streaming
Converts all numerical features to float64 type, ensuring consistent column order and data types across training and inference
Loads necessary dependencies (PyCaret regression modules, pandas, numpy, etc.)
Dynamically names feature columns as f0, f1, ... to guarantee consistency between training and prediction phases
For supervised learning, attaches an additional target column
Uses save_model(model_name) to persist the trained model, binding it to the provided string ID (here, 'test_regressor') for later loading by the prediction UDF

We extract sufficient samples from the historical wide table to compute regression coefficients α and β in batch mode. Once training completes, model parameters are saved for streaming predictions and residual-based anomaly detection. During real-time inference, these precomputed values can be directly referenced without retraining, avoiding unnecessary overhead from replaying historical data.

Model Deployment

After completing model training, we need to deploy the model into production to generate real-time predictions from streaming data. In Timeplus, there are two common deployment approaches:

1. API Integration (Model Deployed Outside Timeplus)

The model is deployed as an independent inference service (e.g., Flask, FastAPI, SageMaker). Timeplus retrieves predictions by calling this external API through stream queries.

Advantages:

Supports arbitrarily complex models (deep learning, large-scale regression, ensemble models)
Works with models implemented in Python, R, TensorFlow/ONNX, etc.

Disadvantages:

Network latency—each prediction requires a remote call
Requires maintaining API and supporting infrastructure
API failures directly affect prediction results

2. UDF Integration (Model Deployed Inside Timeplus)

The model is directly packaged as a Timeplus Python UDF and invoked locally within queries, with both prediction logic and data processing executed on the same platform.

Advantages:

Low latency
No external dependencies
Simplified deployment

Disadvantages:

Not suitable for computationally intensive or highly complex models
Model logic or parameter updates require updating the UDF

Since our use case involves a simple linear regression model requiring low-latency predictions in a real-time streaming environment, UDF integration is the most appropriate choice.

Prediction

Once model deployment completes, the prediction phase becomes straightforward. Using the trained regression parameters α and β combined with real-time openCount features, we calculate expected CPU usage. For each new incoming row in the real-time stream, the platform instantly generates predicted CPU values, easily retrieved using a simple SELECT statement:

SELECT
    predict_pycaret_regressor([to_float64(openCount), to_float64(closeCount)],
    'test_regressor')
FROM
    mv_metrics_features_10s;

From Prediction to Action

In a machine learning workflow, prediction is the model's final step, but for the overall business system, it's only the starting point of automated decision-making.

In this project, the model predicts each host's CPU usage for the next few seconds. These predictions can directly drive various business operations:

Automated actions:

Scaling up resources when high load is predicted
Restricting high-risk transactions
Blocking abnormal accounts
Restarting faulty nodes

Manual interventions:

Sending alerts to engineers or traders
Generating risk reports for management decision-making

Through this approach, prediction becomes more than numerical output—it becomes a core driver of real-time business decisions.

Key Takeaways

Initially, I thought anomaly detection was simply about catching outliers. This project taught me it's fundamentally about understanding behavior. By combining business metrics with system signals in a regression-driven, real-time pipeline, I gained visibility into how trading activity shapes infrastructure load—and learned to predict issues before they surface.

Timeplus made it possible to engineer features, pivot data, and detect anomalies within sub-second latency, turning raw data into actionable intelligence.

Why Real-Time Streaming Transforms Anomaly Detection

Real-time streaming shifts ML-based anomaly detection from a reactive process to a proactive defence:

Low-latency detection – Analyzes data the moment it arrives, identifying anomalies within seconds of their occurrence rather than minutes or hours later
Preserves temporal context – Captures the full sequence of events—spikes, drifts, and dynamic correlations—that batch processing often misses
Eliminates architectural complexity – Embedding model inference directly into the streaming pipeline removes the need for separate ETL jobs, data lakes, and batch processing schedules
Enables immediate action – Predictions trigger real-time responses: automated scaling, intelligent circuit breakers, targeted alerts, or graceful degradation strategies

This creates a complete detection-to-response feedback loop that transforms system operations from fire-fighting to proactive management.

Further Explorations

This project represents just the beginning. Several areas warrant deeper exploration:

Model sophistication: While linear regression provides a solid baseline, non-linear relationships between business metrics and system performance may benefit from gradient boosting or neural network approaches—particularly when modeling multiple interdependent services.

Multi-metric fusion: Current predictions use openCount alone. Incorporating additional business signals (order values, user cohorts, market volatility) and system metrics (network I/O, disk latency, queue depths) could significantly improve prediction accuracy.

Adaptive thresholds: Static residual thresholds work well in stable environments but struggle during traffic pattern changes. Implementing dynamic thresholds that adapt to recent baseline shifts would reduce false positives during legitimate load changes.

Causal analysis: When anomalies occur, understanding why matters as much as detecting that they happened. Integrating causal inference techniques could help distinguish between correlated metrics and true causal relationships, accelerating root cause analysis.

Conclusion

Building this system taught me that effective anomaly detection isn't just about sophisticated algorithms—it's about bridging the gap between business operations and infrastructure performance. The most valuable insights came not from the model itself, but from understanding how user behavior propagates through system layers.

For anyone building similar systems, I'd emphasize: start simple, measure rigorously, and iterate based on real operational feedback. A basic regression model that runs reliably in production beats a complex neural network that never ships. Focus first on the data pipeline, feature quality, and operational integration—the model can always be upgraded later.

The future of system reliability lies in these proactive, context-aware approaches. As systems grow more complex and user expectations for uptime increase, the ability to predict and prevent issues before they impact users will separate robust platforms from fragile ones.

Real-time anomaly detection isn't just a technical capability—it's a competitive advantage.

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Detecting Anomalous Metrics Using Timeplus

Why System Stability Matters

Understanding System-Level Anomalies

Choosing the Right Approach: Rule-Based vs. Data-Driven

How to Detect Anomalies

The Timeplus Advantage

An End-to-End Real-Time Machine Learning Flow

Data Collection

Understanding the Schema

Feature Engineering

From Narrow to Wide Tables

The Transformation Process

Model Training

Understanding Python UDFs

Batch Training Approach

Model Deployment

1. API Integration (Model Deployed Outside Timeplus)

2. UDF Integration (Model Deployed Inside Timeplus)

Prediction

From Prediction to Action

Key Takeaways

Why Real-Time Streaming Transforms Anomaly Detection

Further Explorations

Conclusion

Related Posts

WHY TIMEPLUS?

PRODUCT

DEPLOYMENT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Why System Stability Matters

Understanding System-Level Anomalies

Choosing the Right Approach: Rule-Based vs. Data-Driven

How to Detect Anomalies

The Timeplus Advantage

An End-to-End Real-Time Machine Learning Flow

Data Collection

Understanding the Schema

Feature Engineering

From Narrow to Wide Tables

The Transformation Process

Model Training

Understanding Python UDFs

Batch Training Approach

Model Deployment

1. API Integration (Model Deployed Outside Timeplus)

2. UDF Integration (Model Deployed Inside Timeplus)

Prediction

From Prediction to Action

Key Takeaways

Why Real-Time Streaming Transforms Anomaly Detection

Further Explorations

Conclusion