Detecting Anomalous Metrics Using Timeplus
- Junyue (Carol) Xue

- Oct 3
- 9 min read
My name is Junyue (Carol) Xue, and I'm a third-year student at the University of British Columbia (UBC). I'm currently interning at Timeplus, where I work with Gang Tao on a real-time anomaly detection project. This blog combines technical details with personal reflections on applying classroom concepts to real trading systems.
Why System Stability Matters
During my quantitative trading internship, I learned firsthand how critical system stability is. Trading strategies must respond to millisecond-level signals. When CPU or memory indicators show anomalies, the entire strategy can experience delays or miss trading opportunities entirely.
Initially, when trades were delayed or rejected, I instinctively debugged the strategy itself. Over time, however, I realized that many issues weren't caused by strategic errors at all—they stemmed from dirty data and infrastructure anomalies: memory usage spikes, slow responses from the order management engine, or hidden queue buildups deep in the DevOps stack.
Understanding System-Level Anomalies
Many system-level anomalies are actually driven by spikes in business activity. For example:
When users place a large number of orders (tracked as openCount), the system initiates order processing logic, directly increasing CPU and GPU usage
During predictable busy periods—such as 9:00 AM market open—user activity surges before infrastructure metrics spike, sometimes with a 5–10 second delay
If we examine DevOps metrics in isolation, we might misclassify a healthy traffic spike as an anomaly. By incorporating business context, we can distinguish between expected high load and genuinely abnormal behavior.
Choosing the Right Approach: Rule-Based vs. Data-Driven
When building an anomaly detection system for production environments, one fundamental decision is whether to use rule-based or data-driven methods.
Rule-based systems are highly interpretable—each rule is grounded in well-understood business logic. However, when clear rules don't exist and data carries inherent uncertainty, we must turn to data-driven approaches for anomaly detection.
How to Detect Anomalies
Rather than treating anomaly detection purely as a statistical outlier problem, we reframe it as a regression-based prediction task.
Traditionally, many teams use tools like Prometheus + Grafana, ELK, or AWS CloudWatch for anomaly detection. While these solutions offer flexible visualization and rule-based configuration, they rely heavily on static thresholds or simple statistical rules. Prometheus lacks built-in machine learning capabilities, and AWS CloudWatch is limited to the AWS ecosystem. Most importantly, these traditional tools struggle to model the complex relationships between business behavior and system performance.
The Timeplus Advantage
Timeplus offers a unified real-time data streaming platform designed for modern anomaly detection:
Unified real-time data streams – seamlessly integrates both business and DevOps metrics without additional data pipelines
Built-in machine learning – supports regression modeling, residual-based detection, and more
Low-latency alerts – combines PyCaret-based regression predictions with residual detection to flag anomalies before they impact critical systems
This blog explores how I built a real-time anomaly detection system with Timeplus to identify potential issues before they affect trading strategies. The platform enables both traditional methods (detecting deviations from historical patterns) and advanced regression-based approaches that integrate business and DevOps metrics in real time, moving us beyond simple rule-based alerts.
An End-to-End Real-Time Machine Learning Flow
In this guide, we build a complete end-to-end example: predicting per-host CPU usage from the business activity metric openCount at 10-second intervals, learning a regression baseline, and flagging residual spikes as anomalies.
Unlike traditional batch machine learning, every step runs in-stream on Timeplus:
Data collection
Temporal alignment and windowing
Feature creation
Cleaning and normalization
Training handoff
Live inference
Data Collection
We work with two primary data types:
Understanding the Schema
When integrating production data, you can retrieve and understand stream schemas in two ways:
View schema directly from the data source: If your data source is Kafka, Pulsar, or another platform supporting schemas, create an External Stream in Timeplus to automatically fetch field definitions and data types
Use SQL to describe the stream: In Timeplus, use the DESCRIBE STREAM command to quickly view field names, data types, and comments
For this anomaly detection use case, we focus on three core fields:
CREATE STREAM default.t_metrics
(
'event_ts' uint64, -- event timestamp in ms
'metric' string, -- metric name (e.g. openCount, cpu_usage)
'value' float64 -- metric value
);Feature Engineering
Feature engineering transforms raw data into meaningful features that help models learn relationships between inputs and outputs. At its core, a machine learning model fits a function. If we feed scattered, noisy raw data directly to the model, it often struggles to identify patterns.
Feature engineering bridges this gap by converting unstructured data into structured, informative inputs—for example, aggregating related metrics into a shared time context or creating temporal features like lags and rolling statistics.
From Narrow to Wide Tables
In Timeplus, production metric data is typically stored in a narrow table format, where each row represents a combination of timestamp + metric name + value. This design makes adding new metrics easy—simply append new rows without modifying the table schema.
Example narrow table:
During feature engineering, we need to transform this into a wide table where each timestamp corresponds to a single row, with each metric as a separate column (one column per feature). This format is essential for model training and residual calculation.
Example wide table:
The Transformation Process
This materialized view aggregates the real-time narrow table into fixed 10-second tumbling windows using tumble(..., 10s). Within each window, all (metric, value) records are packed into an array using group_array, then array_first extracts values by metric name to convert the narrow vertical table into a wide horizontal table.
The process works as follows:
group_array((metric, value)) – Combines all (metric, value) pairs within the same 10-second window into a single array
array_first(...) – Finds the value corresponding to a specific metric name and assigns it to the appropriate column. For example, the value of cpu_usage is extracted and placed into the cpu_usage column
Each row now represents the complete set of features for a given host within a 10-second window, such as openCount and cpu_usage. This completes the core feature engineering: aligning metrics along the time dimension and expanding feature columns, providing directly usable data inputs for model training and real-time prediction.
Model Training
We assume a linear relationship between CPU usage and openCount. In Timeplus, we leverage a Python UDF (User-Defined Function) to train a regression model. The model predicts the expected CPU value under a given business load in real time. After generating predictions, we compare them with actual CPU usage to calculate residuals. If residuals significantly deviate from expectations, we flag that time point as an anomaly.
We start with a simple linear model for initial evaluation, only considering more advanced methods (Isolation Forest, kNN, PCA) if residuals exhibit clear nonlinear patterns or frequent false positives.
Understanding Python UDFs
Timeplus introduces Python UDFs, enabling users to seamlessly embed Python functions within SQL queries (see official documentation).
Batch Training Approach
In this end-to-end example, regression model training uses historical data in a single batch call, not streaming replay. Since linear regression depends only on the overall distribution of all historical samples and not on temporal event order, chronological replay isn't necessary.
The training process:
Uses table(...) to enforce bounded batch reads instead of unbounded streaming
Converts all numerical features to float64 type, ensuring consistent column order and data types across training and inference
Loads necessary dependencies (PyCaret regression modules, pandas, numpy, etc.)
Dynamically names feature columns as f0, f1, ... to guarantee consistency between training and prediction phases
For supervised learning, attaches an additional target column
Uses save_model(model_name) to persist the trained model, binding it to the provided string ID (here, 'test_regressor') for later loading by the prediction UDF
We extract sufficient samples from the historical wide table to compute regression coefficients α and β in batch mode. Once training completes, model parameters are saved for streaming predictions and residual-based anomaly detection. During real-time inference, these precomputed values can be directly referenced without retraining, avoiding unnecessary overhead from replaying historical data.
Model Deployment
After completing model training, we need to deploy the model into production to generate real-time predictions from streaming data. In Timeplus, there are two common deployment approaches:
1. API Integration (Model Deployed Outside Timeplus)
The model is deployed as an independent inference service (e.g., Flask, FastAPI, SageMaker). Timeplus retrieves predictions by calling this external API through stream queries.
Advantages:
Supports arbitrarily complex models (deep learning, large-scale regression, ensemble models)
Works with models implemented in Python, R, TensorFlow/ONNX, etc.
Disadvantages:
Network latency—each prediction requires a remote call
Requires maintaining API and supporting infrastructure
API failures directly affect prediction results
2. UDF Integration (Model Deployed Inside Timeplus)
The model is directly packaged as a Timeplus Python UDF and invoked locally within queries, with both prediction logic and data processing executed on the same platform.
Advantages:
Low latency
No external dependencies
Simplified deployment
Disadvantages:
Not suitable for computationally intensive or highly complex models
Model logic or parameter updates require updating the UDF
Since our use case involves a simple linear regression model requiring low-latency predictions in a real-time streaming environment, UDF integration is the most appropriate choice.
Prediction
Once model deployment completes, the prediction phase becomes straightforward. Using the trained regression parameters α and β combined with real-time openCount features, we calculate expected CPU usage. For each new incoming row in the real-time stream, the platform instantly generates predicted CPU values, easily retrieved using a simple SELECT statement:
SELECT
predict_pycaret_regressor([to_float64(openCount), to_float64(closeCount)],
'test_regressor')
FROM
mv_metrics_features_10s;From Prediction to Action
In a machine learning workflow, prediction is the model's final step, but for the overall business system, it's only the starting point of automated decision-making.
In this project, the model predicts each host's CPU usage for the next few seconds. These predictions can directly drive various business operations:
Automated actions:
Scaling up resources when high load is predicted
Restricting high-risk transactions
Blocking abnormal accounts
Restarting faulty nodes
Manual interventions:
Sending alerts to engineers or traders
Generating risk reports for management decision-making
Through this approach, prediction becomes more than numerical output—it becomes a core driver of real-time business decisions.
Key Takeaways
Initially, I thought anomaly detection was simply about catching outliers. This project taught me it's fundamentally about understanding behavior. By combining business metrics with system signals in a regression-driven, real-time pipeline, I gained visibility into how trading activity shapes infrastructure load—and learned to predict issues before they surface.
Timeplus made it possible to engineer features, pivot data, and detect anomalies within sub-second latency, turning raw data into actionable intelligence.
Why Real-Time Streaming Transforms Anomaly Detection
Real-time streaming shifts ML-based anomaly detection from a reactive process to a proactive defence:
Low-latency detection – Analyzes data the moment it arrives, identifying anomalies within seconds of their occurrence rather than minutes or hours later
Preserves temporal context – Captures the full sequence of events—spikes, drifts, and dynamic correlations—that batch processing often misses
Eliminates architectural complexity – Embedding model inference directly into the streaming pipeline removes the need for separate ETL jobs, data lakes, and batch processing schedules
Enables immediate action – Predictions trigger real-time responses: automated scaling, intelligent circuit breakers, targeted alerts, or graceful degradation strategies
This creates a complete detection-to-response feedback loop that transforms system operations from fire-fighting to proactive management.
Further Explorations
This project represents just the beginning. Several areas warrant deeper exploration:
Model sophistication: While linear regression provides a solid baseline, non-linear relationships between business metrics and system performance may benefit from gradient boosting or neural network approaches—particularly when modeling multiple interdependent services.
Multi-metric fusion: Current predictions use openCount alone. Incorporating additional business signals (order values, user cohorts, market volatility) and system metrics (network I/O, disk latency, queue depths) could significantly improve prediction accuracy.
Adaptive thresholds: Static residual thresholds work well in stable environments but struggle during traffic pattern changes. Implementing dynamic thresholds that adapt to recent baseline shifts would reduce false positives during legitimate load changes.
Causal analysis: When anomalies occur, understanding why matters as much as detecting that they happened. Integrating causal inference techniques could help distinguish between correlated metrics and true causal relationships, accelerating root cause analysis.
Conclusion
Building this system taught me that effective anomaly detection isn't just about sophisticated algorithms—it's about bridging the gap between business operations and infrastructure performance. The most valuable insights came not from the model itself, but from understanding how user behavior propagates through system layers.
For anyone building similar systems, I'd emphasize: start simple, measure rigorously, and iterate based on real operational feedback. A basic regression model that runs reliably in production beats a complex neural network that never ships. Focus first on the data pipeline, feature quality, and operational integration—the model can always be upgraded later.
The future of system reliability lies in these proactive, context-aware approaches. As systems grow more complex and user expectations for uptime increase, the ability to predict and prevent issues before they impact users will separate robust platforms from fragile ones.
Real-time anomaly detection isn't just a technical capability—it's a competitive advantage.


