Real-Time GPU Monitoring for AI Workloads Using Timeplus

Gang Tao
May 8
5 min read

When OpenAI's chief executive Sam Altman joked on Twitter about their "melting GPUs," he was playfully acknowledging the extreme processing requirements of artificial intelligence development. Fortunately, we can create a monitoring framework to track these powerful processors before they reach critical temperature thresholds.

Source: https://fortune.com/2025/03/28/sam-altman-chatgpt-gpus-melting-ai-images/

In today's AI-driven world, GPUs are the workhorses powering everything from large language models to generative image systems. As models grow larger and more complex, the computational demands on these specialized processors continue to increase exponentially. Without proper monitoring, this can lead to thermal throttling, reduced performance, or even hardware damage.

In this blog post, we'll explore how to build a real-time GPU monitoring system using Timeplus, a streaming analytics platform, alongside NVIDIA's DCGM-Exporter, Vector, and Redpanda. This stack provides millisecond-level precision monitoring that can help detect issues before they affect your AI workloads or damage your expensive hardware.

The Challenge of GPU Monitoring for AI Workloads

Modern AI workloads push GPUs to their absolute limits. Training large language models can utilize 100% of GPU resources for days or even weeks. Inference workloads may have different patterns, with frequent spikes in utilization as requests come in.

Traditional monitoring solutions that poll metrics every few minutes simply can't provide the granularity needed to detect short-lived but potentially harmful events. Additionally, the volume of metrics generated by a GPU cluster can be overwhelming for conventional time-series databases.

What's needed is a solution that can:

Collect detailed GPU metrics with high frequency
Process and analyze this data in real-time
Trigger alerts or actions based on conditions
Store historical data for performance analysis and capacity planning

Technical Architecture Overview

Our solution uses a modern streaming architecture to collect, process, and analyze GPU metrics in real-time:

NVIDIA DCGM-Exporter: Collects comprehensive GPU metrics from NVIDIA Data Center GPU Manager (DCGM)
Vector: Processes and transforms metrics before forwarding them
Redpanda: A Kafka-compatible streaming data platform that serves as our message bus
Timeplus: Performs real-time analytics and visualization on the streaming data

This architecture provides the foundation for monitoring any GPU-intensive workload, from AI model training to inference services.

Setting Up the Environment

I use a Docker Stack to demonstrate how this solution works, and you can migrate to the baremetal or Kubernetes environment accordingly.

The demo solution can be deployed using Docker Compose, making it easy to set up on any machine with NVIDIA GPUs. Here's our docker-compose.yml file:

https://github.com/timeplus-io/examples/blob/main/gpu-monitor/docker-compose.yaml

This setup includes:

Ollama and OpenWebUI as example AI workloads that will generate GPU usage
Redpanda as our streaming platform
DCGM-Exporter to collect GPU metrics
Vector to process and forward metrics
Timeplus for real-time monitor, analytics and visualization

Data Pipeline

GPU Metrics Collector with DCGM-Exporter

NVIDIA's DCGM-Exporter exposes a wealth of GPU metrics in Prometheus format. Some of the key metrics include:

DCGM_FI_DEV_GPU_TEMP: GPU temperature
DCGM_FI_DEV_POWER_USAGE: Power consumption in watts
DCGM_FI_DEV_GPU_UTIL: GPU utilization percentage
DCGM_FI_DEV_MEM_COPY_UTIL: Memory utilization percentage
DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED: Free and used GPU memory

These metrics are exposed via an HTTP endpoint which Vector will scrape at regular intervals.

Data collection Pipeline with Vector

Vector acts as our data collection and transformation layer, which can be used to collect the metrics and send to different downstream sinks.

Here's the configuration we're using:

https://github.com/timeplus-io/examples/blob/main/gpu-monitor/vector.toml

This configuration:

Scrapes metrics from DCGM-Exporter every 15 seconds, in case users want to get more frequent updates, a smaller number can be set here.
Filters for DCGM metrics and adds timestamps
Forwards the processed metrics to a Redpanda topic called "gpu-metrics"

The transformation step ensures we're only forwarding relevant GPU metrics and that each metric has a proper timestamp.

With this collector pipeline, users can also add other sinks in case there are different target systems that can be used to process those metrics.

Analyze and Monitor GPU Metrics with Timeplus

Now that we have our metrics flowing into Redpanda (or Kafka, depending on your setup), we can use Timeplus to analyze and monitor them in real-time.

1. Create External Stream

First, we'll create an external stream in Timeplus, which can be used to run real-time queries of GPU metrics on the Kafka topic ‘gpu-metrics’:

CREATE EXTERNAL STREAM default.gpu_metrics
(
  'raw' string
)
SETTINGS 
  type = 'kafka',
  brokers = 'redpanda:9092', 
  topic = 'gpu-metrics', 
  security_protocol = 'PLAINTEXT', 
  data_format = 'RawBLOB', 
  skip_ssl_cert_check = false, 
  one_message_per_row = true 
COMMENT 'nvidia gpu metrics'

2. List All Available Metrics

Run following query to get all GPU metrics names:

SELECT DISTINCT
 raw:name
FROM
 gpu_metrics
WHERE
 _tp_time > earliest_ts()

DCGM_FI_DEV_SM_CLOCK: The current clock speed of the GPU's Streaming Multiprocessors (SM), measured in MHz. This indicates how fast the computational cores of the GPU are running.
DCGM_FI_DEV_MEM_CLOCK: The current memory clock speed, measured in MHz. This shows how quickly the GPU can access its dedicated memory.
DCGM_FI_DEV_GPU_TEMP: The current temperature of the GPU die, measured in degrees Celsius. Critical for monitoring thermal conditions to prevent overheating.
DCGM_FI_DEV_POWER_USAGE: The current power consumption of the GPU, measured in watts. Helps track energy efficiency and ensure power delivery is within safe limits.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION: The cumulative energy used by the GPU since monitoring began, measured in millijoules. Useful for calculating total power costs over time.
Etc...

3. Monitor GPU temperature across all GPUs:

The following query returns the CPU temperature in real time combined with the past 1 day data as reference. By monitoring the temperature, users can effectively take actions before the GPU gets to a melting point.

SELECT
 _tp_time, cast(raw:gauge:value, 'float') AS temperature, raw:tags:device AS device, raw:tags:gpu AS gpu, raw:tags:modelName AS model
FROM
 gpu_metrics
WHERE
 (raw:name = 'DCGM_FI_DEV_GPU_TEMP') AND (_tp_time > (now() - 1d))

4. Track GPU Utilization in Real-Time

The following query gives users the real-time GPU utilization:

SELECT
 _tp_time, cast(raw:gauge:value, 'float') AS util, raw:tags:device AS device, raw:tags:gpu AS gpu, raw:tags:modelName AS model
FROM
 gpu_metrics
WHERE
 (raw:name = 'DCGM_FI_DEV_GPU_UTIL') AND (_tp_time > (now() - 1d))

When I ask my Ollama a question, you can see from the trend line chart that the GPU utilization rose from 0 to 93.

Conclusion

This realtime monitoring setup powered by Timeplus enables several important use cases for AI workloads:

Preventing Thermal Throttling By monitoring GPU temperatures in real-time, you can detect when a GPU is approaching its thermal limits and take action before thermal throttling occurs, which would reduce performance.
Optimizing Resource Allocation By analyzing GPU utilization patterns, you can identify underutilized resources and optimize your workload distribution for better efficiency.
Correlating Model Performance with Resource Usage By combining GPU metrics with application metrics, you can understand how changes in your AI models affect resource consumption and identify optimization opportunities.
Capacity Planning Historical GPU usage data can inform decisions about when to scale up your infrastructure to accommodate growing workloads.

Real-time GPU monitoring is essential for organizations running demanding AI workloads. With the solution outlined in this blog post, you can gain deep visibility into your GPU performance and health, helping you prevent issues before they impact your applications or damage your hardware.

The combination of DCGM-Exporter, Vector, Redpanda, and Timeplus provides a flexible and scalable architecture that can grow with your needs. Whether you're running a single GPU for development or a cluster of GPUs in production, this monitoring stack gives you the insights you need to operate efficiently and reliably.

Get started today by cloning our example repository here: https://github.com/timeplus-io/examples/blob/main/gpu-monitor. Try deploying the monitoring stack alongside your AI workloads. Your GPUs (and your budget) will thank you!

WHY TIMEPLUS?

PRODUCT

DEPLOYMENT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Real-Time GPU Monitoring for AI Workloads Using Timeplus

The Challenge of GPU Monitoring for AI Workloads

Technical Architecture Overview

Setting Up the Environment

Data Pipeline

GPU Metrics Collector with DCGM-Exporter

Data collection Pipeline with Vector

Analyze and Monitor GPU Metrics with Timeplus

1. Create External Stream

2. List All Available Metrics

3. Monitor GPU temperature across all GPUs:

4. Track GPU Utilization in Real-Time

Conclusion

Related Posts