top of page

Detecting Anomalous Metrics Using Timeplus

  • Writer: Junyue (Carol) Xue
    Junyue (Carol) Xue
  • Oct 3
  • 9 min read

My name is Junyue (Carol) Xue, and I'm a third-year student at the University of British Columbia (UBC). I'm currently interning at Timeplus, where I work with Gang Tao on a real-time anomaly detection project. This blog combines technical details with personal reflections on applying classroom concepts to real trading systems.



Why System Stability Matters


During my quantitative trading internship, I learned firsthand how critical system stability is. Trading strategies must respond to millisecond-level signals. When CPU or memory indicators show anomalies, the entire strategy can experience delays or miss trading opportunities entirely.

Initially, when trades were delayed or rejected, I instinctively debugged the strategy itself. Over time, however, I realized that many issues weren't caused by strategic errors at all—they stemmed from dirty data and infrastructure anomalies: memory usage spikes, slow responses from the order management engine, or hidden queue buildups deep in the DevOps stack.



Understanding System-Level Anomalies


Many system-level anomalies are actually driven by spikes in business activity. For example:

  • When users place a large number of orders (tracked as openCount), the system initiates order processing logic, directly increasing CPU and GPU usage

  • During predictable busy periods—such as 9:00 AM market open—user activity surges before infrastructure metrics spike, sometimes with a 5–10 second delay

If we examine DevOps metrics in isolation, we might misclassify a healthy traffic spike as an anomaly. By incorporating business context, we can distinguish between expected high load and genuinely abnormal behavior.



Choosing the Right Approach: Rule-Based vs. Data-Driven


When building an anomaly detection system for production environments, one fundamental decision is whether to use rule-based or data-driven methods.


Rule-based systems are highly interpretable—each rule is grounded in well-understood business logic. However, when clear rules don't exist and data carries inherent uncertainty, we must turn to data-driven approaches for anomaly detection.



How to Detect Anomalies


Rather than treating anomaly detection purely as a statistical outlier problem, we reframe it as a regression-based prediction task.


Traditionally, many teams use tools like Prometheus + Grafana, ELK, or AWS CloudWatch for anomaly detection. While these solutions offer flexible visualization and rule-based configuration, they rely heavily on static thresholds or simple statistical rules. Prometheus lacks built-in machine learning capabilities, and AWS CloudWatch is limited to the AWS ecosystem. Most importantly, these traditional tools struggle to model the complex relationships between business behavior and system performance.


The Timeplus Advantage


Timeplus offers a unified real-time data streaming platform designed for modern anomaly detection:

  • Unified real-time data streams – seamlessly integrates both business and DevOps metrics without additional data pipelines

  • Built-in machine learning – supports regression modeling, residual-based detection, and more

  • Low-latency alerts – combines PyCaret-based regression predictions with residual detection to flag anomalies before they impact critical systems


This blog explores how I built a real-time anomaly detection system with Timeplus to identify potential issues before they affect trading strategies. The platform enables both traditional methods (detecting deviations from historical patterns) and advanced regression-based approaches that integrate business and DevOps metrics in real time, moving us beyond simple rule-based alerts.



An End-to-End Real-Time Machine Learning Flow


In this guide, we build a complete end-to-end example: predicting per-host CPU usage from the business activity metric openCount at 10-second intervals, learning a regression baseline, and flagging residual spikes as anomalies.

Unlike traditional batch machine learning, every step runs in-stream on Timeplus:

  • Data collection

  • Temporal alignment and windowing

  • Feature creation

  • Cleaning and normalization

  • Training handoff

  • Live inference


Data Collection


We work with two primary data types:

Data Type

Raw Examples

Source

Generation Process

Final Metric

Usage

Business Activity

request_log, api_call

Application server logs, API tracing

Aggregated over 10s windows, counting requests per host

openCount

Main independent variable: predicts CPU usage based on request volume

System Performance

cpu_usage_raw, mem_raw

System monitoring agents (Prometheus, Datadog, etc.)

Within 10s windows, calculates average or peak CPU usage, memory, etc.

cpu_usage

Prediction target: predicted from openCount; residuals flag anomalies


Understanding the Schema


When integrating production data, you can retrieve and understand stream schemas in two ways:

  1. View schema directly from the data source: If your data source is Kafka, Pulsar, or another platform supporting schemas, create an External Stream in Timeplus to automatically fetch field definitions and data types

  2. Use SQL to describe the stream: In Timeplus, use the DESCRIBE STREAM command to quickly view field names, data types, and comments


For this anomaly detection use case, we focus on three core fields:


CREATE STREAM default.t_metrics
(
    'event_ts'  uint64,   -- event timestamp in ms
    'metric'    string,   -- metric name (e.g. openCount, cpu_usage)
    'value'     float64   -- metric value
);


Feature Engineering


Feature engineering transforms raw data into meaningful features that help models learn relationships between inputs and outputs. At its core, a machine learning model fits a function. If we feed scattered, noisy raw data directly to the model, it often struggles to identify patterns.

Feature engineering bridges this gap by converting unstructured data into structured, informative inputs—for example, aggregating related metrics into a shared time context or creating temporal features like lags and rolling statistics.



From Narrow to Wide Tables


In Timeplus, production metric data is typically stored in a narrow table format, where each row represents a combination of timestamp + metric name + value. This design makes adding new metrics easy—simply append new rows without modifying the table schema.


Example narrow table:

event_ts

metric

value

1723852200000

cpu

0.75

1723852200000

openCount

123

1723852210000

cpu

0.78

1723852210000

openCount

140

During feature engineering, we need to transform this into a wide table where each timestamp corresponds to a single row, with each metric as a separate column (one column per feature). This format is essential for model training and residual calculation.


Example wide table:

_tp_time

cpu

openCount

1723852200000

0.75

123

1723852210000

0.78

140


The Transformation Process

This materialized view aggregates the real-time narrow table into fixed 10-second tumbling windows using tumble(..., 10s). Within each window, all (metric, value) records are packed into an array using group_array, then array_first extracts values by metric name to convert the narrow vertical table into a wide horizontal table.


The process works as follows:

  1. group_array((metric, value)) – Combines all (metric, value) pairs within the same 10-second window into a single array

  2. array_first(...) – Finds the value corresponding to a specific metric name and assigns it to the appropriate column. For example, the value of cpu_usage is extracted and placed into the cpu_usage column


Each row now represents the complete set of features for a given host within a 10-second window, such as openCount and cpu_usage. This completes the core feature engineering: aligning metrics along the time dimension and expanding feature columns, providing directly usable data inputs for model training and real-time prediction.



Model Training


We assume a linear relationship between CPU usage and openCount. In Timeplus, we leverage a Python UDF (User-Defined Function) to train a regression model. The model predicts the expected CPU value under a given business load in real time. After generating predictions, we compare them with actual CPU usage to calculate residuals. If residuals significantly deviate from expectations, we flag that time point as an anomaly.


We start with a simple linear model for initial evaluation, only considering more advanced methods (Isolation Forest, kNN, PCA) if residuals exhibit clear nonlinear patterns or frequent false positives.



Understanding Python UDFs


Timeplus introduces Python UDFs, enabling users to seamlessly embed Python functions within SQL queries (see official documentation).



Batch Training Approach


In this end-to-end example, regression model training uses historical data in a single batch call, not streaming replay. Since linear regression depends only on the overall distribution of all historical samples and not on temporal event order, chronological replay isn't necessary.


The training process:

  • Uses table(...) to enforce bounded batch reads instead of unbounded streaming

  • Converts all numerical features to float64 type, ensuring consistent column order and data types across training and inference

  • Loads necessary dependencies (PyCaret regression modules, pandas, numpy, etc.)

  • Dynamically names feature columns as f0, f1, ... to guarantee consistency between training and prediction phases

  • For supervised learning, attaches an additional target column

  • Uses save_model(model_name) to persist the trained model, binding it to the provided string ID (here, 'test_regressor') for later loading by the prediction UDF


We extract sufficient samples from the historical wide table to compute regression coefficients α and β in batch mode. Once training completes, model parameters are saved for streaming predictions and residual-based anomaly detection. During real-time inference, these precomputed values can be directly referenced without retraining, avoiding unnecessary overhead from replaying historical data.



Model Deployment

After completing model training, we need to deploy the model into production to generate real-time predictions from streaming data. In Timeplus, there are two common deployment approaches:


1. API Integration (Model Deployed Outside Timeplus)


The model is deployed as an independent inference service (e.g., Flask, FastAPI, SageMaker). Timeplus retrieves predictions by calling this external API through stream queries.


Advantages:

  • Supports arbitrarily complex models (deep learning, large-scale regression, ensemble models)

  • Works with models implemented in Python, R, TensorFlow/ONNX, etc.


Disadvantages:

  • Network latency—each prediction requires a remote call

  • Requires maintaining API and supporting infrastructure

  • API failures directly affect prediction results



2. UDF Integration (Model Deployed Inside Timeplus)

The model is directly packaged as a Timeplus Python UDF and invoked locally within queries, with both prediction logic and data processing executed on the same platform.


Advantages:

  • Low latency

  • No external dependencies

  • Simplified deployment


Disadvantages:

  • Not suitable for computationally intensive or highly complex models

  • Model logic or parameter updates require updating the UDF


Since our use case involves a simple linear regression model requiring low-latency predictions in a real-time streaming environment, UDF integration is the most appropriate choice.



Prediction


Once model deployment completes, the prediction phase becomes straightforward. Using the trained regression parameters α and β combined with real-time openCount features, we calculate expected CPU usage. For each new incoming row in the real-time stream, the platform instantly generates predicted CPU values, easily retrieved using a simple SELECT statement:

SELECT
    predict_pycaret_regressor([to_float64(openCount), to_float64(closeCount)],
    'test_regressor')
FROM
    mv_metrics_features_10s;

From Prediction to Action


In a machine learning workflow, prediction is the model's final step, but for the overall business system, it's only the starting point of automated decision-making.

In this project, the model predicts each host's CPU usage for the next few seconds. These predictions can directly drive various business operations:


Automated actions:

  • Scaling up resources when high load is predicted

  • Restricting high-risk transactions

  • Blocking abnormal accounts

  • Restarting faulty nodes


Manual interventions:

  • Sending alerts to engineers or traders

  • Generating risk reports for management decision-making


Through this approach, prediction becomes more than numerical output—it becomes a core driver of real-time business decisions.



Key Takeaways


Initially, I thought anomaly detection was simply about catching outliers. This project taught me it's fundamentally about understanding behavior. By combining business metrics with system signals in a regression-driven, real-time pipeline, I gained visibility into how trading activity shapes infrastructure load—and learned to predict issues before they surface.

Timeplus made it possible to engineer features, pivot data, and detect anomalies within sub-second latency, turning raw data into actionable intelligence.



Why Real-Time Streaming Transforms Anomaly Detection


Real-time streaming shifts ML-based anomaly detection from a reactive process to a proactive defence:

  1. Low-latency detection – Analyzes data the moment it arrives, identifying anomalies within seconds of their occurrence rather than minutes or hours later

  2. Preserves temporal context – Captures the full sequence of events—spikes, drifts, and dynamic correlations—that batch processing often misses

  3. Eliminates architectural complexity – Embedding model inference directly into the streaming pipeline removes the need for separate ETL jobs, data lakes, and batch processing schedules

  4. Enables immediate action – Predictions trigger real-time responses: automated scaling, intelligent circuit breakers, targeted alerts, or graceful degradation strategies


This creates a complete detection-to-response feedback loop that transforms system operations from fire-fighting to proactive management.



Further Explorations


This project represents just the beginning. Several areas warrant deeper exploration:


Model sophistication: While linear regression provides a solid baseline, non-linear relationships between business metrics and system performance may benefit from gradient boosting or neural network approaches—particularly when modeling multiple interdependent services.


Multi-metric fusion: Current predictions use openCount alone. Incorporating additional business signals (order values, user cohorts, market volatility) and system metrics (network I/O, disk latency, queue depths) could significantly improve prediction accuracy.


Adaptive thresholds: Static residual thresholds work well in stable environments but struggle during traffic pattern changes. Implementing dynamic thresholds that adapt to recent baseline shifts would reduce false positives during legitimate load changes.


Causal analysis: When anomalies occur, understanding why matters as much as detecting that they happened. Integrating causal inference techniques could help distinguish between correlated metrics and true causal relationships, accelerating root cause analysis.



Conclusion


Building this system taught me that effective anomaly detection isn't just about sophisticated algorithms—it's about bridging the gap between business operations and infrastructure performance. The most valuable insights came not from the model itself, but from understanding how user behavior propagates through system layers.


For anyone building similar systems, I'd emphasize: start simple, measure rigorously, and iterate based on real operational feedback. A basic regression model that runs reliably in production beats a complex neural network that never ships. Focus first on the data pipeline, feature quality, and operational integration—the model can always be upgraded later.


The future of system reliability lies in these proactive, context-aware approaches. As systems grow more complex and user expectations for uptime increase, the ability to predict and prevent issues before they impact users will separate robust platforms from fragile ones.


Real-time anomaly detection isn't just a technical capability—it's a competitive advantage.

 
 
bottom of page