Real-Time AI Hallucination Detection with Timeplus: A Chess Example

Gang Tao
Jun 12
6 min read

Updated: Jun 13

AI agents are becoming the new workforce. The market for AI agents was valued at $3.7 billion in 2023 and is expected to reach $150 billion by 2025 (LitsLink, 2025). In McKinsey's latest survey, 78% of organizations now use AI in at least one business function, up from just 55% a year earlier (McKinsey, 2025). A recent IBM survey found that 99% of developers building AI applications are exploring or developing AI agents (IBM, 2025).

But as AI agents get smarter and more independent, what happens when they make mistakes? It’s important that we have a way to watch them closely and catch errors as they happen.

A Simple Test: AI Agents Playing Chess

I recently built an application where two AI agents play chess against each other (based on the autogen core samples). Each agent uses a language model to think about the game and make moves. It looks simple, but it's actually a great way to test and learn how AI agents behave.

Here's how each the AI agent works in this case:

Observe: Check the current chess board and legal moves
Think: Decide what move to make
Act: Make the move

This is called the ReAct pattern – a method first described in the research paper "ReAct: Synergizing Reasoning and Acting in Language Models." It's now widely used as the go-to design for building AI agents.

The agents talk to each other through messages, sharing their moves and thoughts through a communication channel. Everything works well... until it doesn't.

When AI Agents Make Mistakes

The AI agent works independently, deciding which functions to call based on the task instructions. Normally, it should call functions in order: get board, check valid moves, select one move, and play only once. Most of the time, language models follow this pattern correctly. But when they don’t, this is called "hallucination." Weaker models like GPT-4.1-nano are more likely to hallucinate than stronger models like GPT-4o.

Our chess agents make two main types of errors:

Illegal Moves: The agent tries to make a move that's not allowed in chess
Moving Twice: The agent tries to move twice in a row

In a chess game, you might think it’s not a big deal if AI agents make mistakes. But imagine if these AI agents were handling money, making medical decisions, or controlling self-driving cars. Mistakes like these could be dangerous.

Using Timeplus to Watch for Hallucinations

In this sample application, I replaced the agent communication from the internal message queue to Timeplus stream. By doing so, all agent communications are now logged into Timeplus stream. I can use streaming query to monitor what’s happening for these agents in real time and get notified as soon as the hallucination happens.

Extract Information from JSON messages

Here is a sample message the white player make a function call to tool to make a move:

{
   "message_type": "send",
   "message_id": "e76f2da4-b4c8-4896-a05b-787afc1c47a9",
   "sender": "PlayerWhite/default",
   "message_payload": "{\"_autogen_type\": \"object\", \"_class\": \"FunctionCall\", \"_module\": \"autogen_core._types\", \"_data\": {\"id\": \"call_bP7z0mTCCDpLwKnTtcFELDcz\", \"arguments\": \"{\\\"thinking\\\":\\\"Move king from a6 to a5 to approach the opponent's king and support pawn advancement.\\\",\\\"move\\\":\\\"a6a5\\\"}\", \"name\": \"make_move\"}}",
   "metadata": "{\"traceparent\": null, \"tracestate\": null, \"links\": null}",
   "recipient": "PlayerWhiteToolAgent/default",
   "topic_id": null,
   "is_error": false
}

There are three different types of messages:

Publish
Send
Response

First, let's extract those fields from the JSON using a Timeplus SQL query:

CREATE VIEW message AS
SELECT
  _tp_time AS time, 
  _value:message_type AS message_type, 
  _value:message_id AS message_id, 
  _value:sender AS sender, 
  _value:message_payload AS message_payload, 
  _value:recipient AS recipient
FROM
  {{channel}}
WHERE
  _tp_time > earliest_ts()

Finding Agents That Move Twice

Using a Timeplus streaming query, we can monitor the hallucinations behavior of the agent in realtime, as the query will continuously run and emit results as soon as the new event comes.

We wrote a simple query to catch when the same player tries to move twice:

With moves AS
(
 SELECT
 sender, message_payload:_data.arguments.thinking AS thinking, message_payload:_data.arguments.move AS move
 FROM
 messages
 WHERE
 message_type = 'send' AND message_payload:_class = 'FunctionCall' AND message_payload:_data.name = 'make_move'
)
SELECT
 sender as current_player, lag(sender) as previous_player
FROM
 moves
where current_player = previous_player

This query looks at who made each move and finds cases where the same agent moved twice in a row.

Step 1: Extract Move Data (WITH moves AS...)

Looks at all records from messages view
Filters for only "send" messages (when agents are making requests)
Finds messages that are function calls specifically for "make_move"
Extracts who sent it (sender), what they were thinking, and what move they made
Creates a temporary table called "moves" with this data

Step 2: Find Consecutive Moves

Uses lag(sender) to get the previous row's sender (who moved before)
Compares the current player with the previous player
Only shows results where they're the same player

If the query finds a problem, you might see:

current_player | previous_player
PlayerBlack   | PlayerBlack

Catching Illegal Moves

We also check if agents try to make moves that aren't legal:

WITH request_response AS
 SELECT
      time, sender, message_payload, message_type
    FROM
      messages
    WHERE
      message_type in ('send', 'response') 
  ), consecutive_calls AS
  (
    SELECT
      time, sender, lag(sender) as previous_sender, message_payload, lag(message_payload) AS previous_payload, message_type
    FROM
      request_response
  )
SELECT
  time, sender, previous_sender, message_payload:_data:arguments:move as move, previous_payload:_data:content as legal_moves, position(legal_moves, move) > 0 as legal
FROM
  consecutive_calls
WHERE
  message_payload:_data:name = 'make_move' and message_type = 'send' and previous_payload:_data:name = 'get_legal_moves' and not legal

This helps us spot when an agent invents a move that doesn't follow chess rules.

Step 1: Get Request and Response Messages

This collects all messages that are either:

'send' - when an agent makes a request (like "get legal moves" or "make move")
'response' - when the system responds back (like returning the list of legal moves)

Step 2: Pair Each Message with the Previous One

This creates pairs of consecutive messages using lag():

Current message and the message that came right before it
This lets us see the sequence: "get legal moves" → response → "make move"

Step 3: Find Illegal Moves

This final step:

Looks for "make_move" requests
Checks if the previous message was a "get_legal_moves" response
Extracts the attempted move and the list of legal moves
Uses position(legal_moves, move) > 0 to check if the move is in the legal moves list
Only shows results where NOT legal (the move is illegal)

A sample illegal move check result is:

time | sender | move | legal_moves | legal 
10:30:15 | Agent_White | e2e8 | e2e4,d2d4,g1f3 | false

Now we have those realtime queries that can monitor those hallucinations in realtime, so you know whether the agent is working as expected and get notified as soon as something bad happens.

Chess is just a demo case. The real value comes when we apply this monitoring to important systems:

Banking: Stop AI agents from making wrong financial decisions
Healthcare: Make sure medical AI doesn't give dangerous advice
Customer Service: Catch when chatbots give wrong information

So, why Timeplus?

In this sample application, Timeplus takes two roles

The persistent communication channel for the agent
The observability instrumental layer for the agent

By replacing the communication from internal memory query to persistent stream, follow features can be supported:

Distributed agent, as agent can be running any where, just send/consume message from Timeplus stream
Durable agent, agent now can recover from the conversation as all the messages are persisted in the Timeplus stream
Time travel, similarly, with Timeplus stream containing all the conversation history, we can replay agents' work from any point of time.

And with all messages persisted in the Timeplus stream, a better and real time observability is provided:

Monitor agent behavior in real time
Secure the agent actions, prevent agent calling illegal functions or access unsecured data
Analysis agent operations and make improvement adjustment

AI agents will make mistakes. That's normal. The important thing is catching those mistakes quickly before they cause problems.

Our chess-playing agents taught us that even simple AI systems need careful monitoring. As AI agents become more powerful and handle more important tasks, watching them closely becomes even more critical.

With Timeplus, we can see what our AI agents are doing, catch their mistakes, and fix problems before they get serious. This helps us build AI systems that people can trust.

The chess game continues, but now we're watching every move in real-time. We know when our agents think clearly and when they get confused. Most importantly, we can step in before small mistakes become big problems.

WHY TIMEPLUS?

PRODUCT

DEPLOYMENT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Real-Time AI Hallucination Detection with Timeplus: A Chess Example

A Simple Test: AI Agents Playing Chess

When AI Agents Make Mistakes

Using Timeplus to Watch for Hallucinations

Extract Information from JSON messages

Finding Agents That Move Twice

Catching Illegal Moves

So, why Timeplus?

References

Related Posts