top of page

Real-Time GPU Monitoring for AI Workloads Using Timeplus

  • Writer: Gang Tao
    Gang Tao
  • May 8
  • 5 min read

When OpenAI's chief executive Sam Altman joked on Twitter about their "melting GPUs," he was playfully acknowledging the extreme processing requirements of artificial intelligence development. Fortunately, we can create a monitoring framework to track these powerful processors before they reach critical temperature thresholds.




In today's AI-driven world, GPUs are the workhorses powering everything from large language models to generative image systems. As models grow larger and more complex, the computational demands on these specialized processors continue to increase exponentially. Without proper monitoring, this can lead to thermal throttling, reduced performance, or even hardware damage.


In this blog post, we'll explore how to build a real-time GPU monitoring system using Timeplus, a streaming analytics platform, alongside NVIDIA's DCGM-Exporter, Vector, and Redpanda. This stack provides millisecond-level precision monitoring that can help detect issues before they affect your AI workloads or damage your expensive hardware.



The Challenge of GPU Monitoring for AI Workloads


Modern AI workloads push GPUs to their absolute limits. Training large language models can utilize 100% of GPU resources for days or even weeks. Inference workloads may have different patterns, with frequent spikes in utilization as requests come in.

Traditional monitoring solutions that poll metrics every few minutes simply can't provide the granularity needed to detect short-lived but potentially harmful events. Additionally, the volume of metrics generated by a GPU cluster can be overwhelming for conventional time-series databases.


What's needed is a solution that can:

  • Collect detailed GPU metrics with high frequency

  • Process and analyze this data in real-time

  • Trigger alerts or actions based on conditions

  • Store historical data for performance analysis and capacity planning



Technical Architecture Overview


Our solution uses a modern streaming architecture to collect, process, and analyze GPU metrics in real-time:




  • NVIDIA DCGM-Exporter: Collects comprehensive GPU metrics from NVIDIA Data Center GPU Manager (DCGM)

  • Vector: Processes and transforms metrics before forwarding them

  • Redpanda: A Kafka-compatible streaming data platform that serves as our message bus

  • Timeplus: Performs real-time analytics and visualization on the streaming data


This architecture provides the foundation for monitoring any GPU-intensive workload, from AI model training to inference services.



Setting Up the Environment


I use a Docker Stack to demonstrate how this solution works, and you can migrate to the baremetal or Kubernetes environment accordingly. 


The demo solution can be deployed using Docker Compose, making it easy to set up on any machine with NVIDIA GPUs. Here's our docker-compose.yml file:


This setup includes:

  • Ollama and OpenWebUI as example AI workloads that will generate GPU usage

  • Redpanda as our streaming platform

  • DCGM-Exporter to collect GPU metrics

  • Vector to process and forward metrics

  • Timeplus for real-time monitor, analytics and visualization



Data Pipeline 


GPU Metrics Collector with DCGM-Exporter


NVIDIA's DCGM-Exporter exposes a wealth of GPU metrics in Prometheus format. Some of the key metrics include:

  • DCGM_FI_DEV_GPU_TEMP: GPU temperature

  • DCGM_FI_DEV_POWER_USAGE: Power consumption in watts

  • DCGM_FI_DEV_GPU_UTIL: GPU utilization percentage

  • DCGM_FI_DEV_MEM_COPY_UTIL: Memory utilization percentage

  • DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED: Free and used GPU memory


These metrics are exposed via an HTTP endpoint which Vector will scrape at regular intervals.



Data collection Pipeline with Vector


Vector acts as our data collection and transformation layer, which can be used to collect the metrics and send to different downstream sinks.


Here's the configuration we're using:


This configuration:

  1. Scrapes metrics from DCGM-Exporter every 15 seconds, in case users want to get more frequent updates, a smaller number can be set here.

  2. Filters for DCGM metrics and adds timestamps

  3. Forwards the processed metrics to a Redpanda topic called "gpu-metrics"


The transformation step ensures we're only forwarding relevant GPU metrics and that each metric has a proper timestamp.


With this collector pipeline, users can also add other sinks in case there are different target systems that can be used to process those metrics.



Analyze and Monitor GPU Metrics with Timeplus


Now that we have our metrics flowing into Redpanda (or Kafka, depending on your setup), we can use Timeplus to analyze and monitor them in real-time.



1. Create External Stream


First, we'll create an external stream in Timeplus, which can be used to run real-time queries of GPU metrics on the Kafka topic ‘gpu-metrics’:

CREATE EXTERNAL STREAM default.gpu_metrics
(
  'raw' string
)
SETTINGS 
  type = 'kafka',
  brokers = 'redpanda:9092', 
  topic = 'gpu-metrics', 
  security_protocol = 'PLAINTEXT', 
  data_format = 'RawBLOB', 
  skip_ssl_cert_check = false, 
  one_message_per_row = true 
COMMENT 'nvidia gpu metrics'


2. List All Available Metrics


Run following query to get all GPU  metrics names:

SELECT DISTINCT
 raw:name
FROM
 gpu_metrics
WHERE
 _tp_time > earliest_ts()

  • DCGM_FI_DEV_SM_CLOCK: The current clock speed of the GPU's Streaming Multiprocessors (SM), measured in MHz. This indicates how fast the computational cores of the GPU are running.

  • DCGM_FI_DEV_MEM_CLOCK: The current memory clock speed, measured in MHz. This shows how quickly the GPU can access its dedicated memory.

  • DCGM_FI_DEV_GPU_TEMP: The current temperature of the GPU die, measured in degrees Celsius. Critical for monitoring thermal conditions to prevent overheating.

  • DCGM_FI_DEV_POWER_USAGE: The current power consumption of the GPU, measured in watts. Helps track energy efficiency and ensure power delivery is within safe limits.

  • DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION: The cumulative energy used by the GPU since monitoring began, measured in millijoules. Useful for calculating total power costs over time.

  • Etc...



3. Monitor GPU temperature across all GPUs:


The following query returns the CPU temperature in real time combined with the past 1 day data as reference. By monitoring the temperature, users can effectively take actions before the GPU gets to a melting point.

SELECT
 _tp_time, cast(raw:gauge:value, 'float') AS temperature, raw:tags:device AS device, raw:tags:gpu AS gpu, raw:tags:modelName AS model
FROM
 gpu_metrics
WHERE
 (raw:name = 'DCGM_FI_DEV_GPU_TEMP') AND (_tp_time > (now() - 1d))


4. Track GPU Utilization in Real-Time


The following query gives users the real-time GPU utilization:

SELECT
 _tp_time, cast(raw:gauge:value, 'float') AS util, raw:tags:device AS device, raw:tags:gpu AS gpu, raw:tags:modelName AS model
FROM
 gpu_metrics
WHERE
 (raw:name = 'DCGM_FI_DEV_GPU_UTIL') AND (_tp_time > (now() - 1d))

When I ask my Ollama a question, you can see from the trend line chart that the GPU utilization rose from 0 to 93.





Conclusion


This realtime monitoring setup powered by Timeplus enables several important use cases for AI workloads:

  • Preventing Thermal Throttling By monitoring GPU temperatures in real-time, you can detect when a GPU is approaching its thermal limits and take action before thermal throttling occurs, which would reduce performance.

  • Optimizing Resource Allocation By analyzing GPU utilization patterns, you can identify underutilized resources and optimize your workload distribution for better efficiency.

  • Correlating Model Performance with Resource Usage By combining GPU metrics with application metrics, you can understand how changes in your AI models affect resource consumption and identify optimization opportunities.

  • Capacity Planning Historical GPU usage data can inform decisions about when to scale up your infrastructure to accommodate growing workloads.


Real-time GPU monitoring is essential for organizations running demanding AI workloads. With the solution outlined in this blog post, you can gain deep visibility into your GPU performance and health, helping you prevent issues before they impact your applications or damage your hardware.


The combination of DCGM-Exporter, Vector, Redpanda, and Timeplus provides a flexible and scalable architecture that can grow with your needs. Whether you're running a single GPU for development or a cluster of GPUs in production, this monitoring stack gives you the insights you need to operate efficiently and reliably.


Get started today by cloning our example repository here: https://github.com/timeplus-io/examples/blob/main/gpu-monitor. Try deploying the monitoring stack alongside your AI workloads. Your GPUs (and your budget) will thank you!


bottom of page