Unlocking Cloud Observability with Confluent and Timeplus Cloud

Gang Tao
Sep 12, 2023
8 min read

Updated: Jul 1, 2024

Observability is a hot topic and it’s hard to do.

As software architecture has evolved from monolithic, to layered client/server, SOA and Cloud-Native micro-service architecture, the goals of modern observability have changed, creating new challenges:

Response at real-time In rapidly evolving systems, real-time monitoring is critical. Processing and analyzing data in real time while maintaining performance can be demanding.
Handle massive data from any source Different components of a system might produce different types of data—logs, performance metrics, traces, etc. Correlating these various data types to gain a comprehensive view can be a challenge.
A single pane of glass As systems evolve, monitoring strategies need to adapt. Ensuring that observability practices keep up with changes in infrastructure and software can be a challenge. This means observability tools need to be easy to use and learn when developing new monitoring queries, dashboards or alerts.

Our core engineering team at Timeplus came from Splunk, so we think about how to use streaming to solve observability challenges a lot. We found a great way to monitor our Timeplus Cloud multi-tenancy infrastructure with our own Timeplus product and Confluent Cloud. Here’s how we did it.

About Timeplus Cloud

Timeplus is a cloud-native, real-time streaming data analytics platform. It's running on AWS EKS, which is an AWS-managed Kubernetes cluster. There are several Timeplus deployments on EKS for different stages and purposes. Each deployment contains its own API services and separate workspaces for every tenant. The heart of each workspace is a core streaming database – the main engine driving all the analytics. These workspaces are totally isolated from each other, ensuring security and performance.

To keep a close watch of the Timeplus cluster, we gather all Kubernetes container logs and metrics. These data are then sent to two Kafka topics in Confluent Cloud. Then we use Timeplus to read these Kafka data and build various analytics and monitors. This includes:

Infrastructure metrics like CPU, Memory, and Disk usage.
Application Metrics – things like user activities, application errors, and the number of queries.
Business Metrics such as workspace costs and profits.

Onboarding Kafka Data

Timeplus Cloud’s metrics and logs are collected using vector and then sent to Kafka topics in Confluent Cloud.

Two Kafka topics: "k8s_logs" and "k8s_metrics" store logs and metrics data in JSON format in Kafka. Here are the sample data:

k8s_logs

{
"container_image":"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.27.4-minimal-eksbuild.2",
"container_name":"kube-proxy",
"message":"I0819 00:26:01.611779       1 proxier.go:1573] \"Reloading service iptables data\" numServices=128 numEndpoints=145 numFilterChains=6 numFilterRules=9 numNATChains=4 numNATRules=132",
"namespace":"kube-system",
"node_name":"ip-10-0-144-12.us-west-2.compute.internal",
"pod_ip":"10.0.144.12",
"pod_name":"kube-proxy-f72xr",
"ts":"2023-08-19T00:26:01.611Z"
}

k8s_metrics

{
"name": "memory_swap_used_bytes",
"tags": {
"collector": "memory",
"host": "vector-agent-gml9w",
"namespace": "host",
"node": "ip-10-0-129-6.us-west-2.compute.internal"
    },
"ts": "2023-08-19T00:28:32.19Z",
"type": "gauge",
"value": 0
}

There are different ways to onboard data from Kafka into Timeplus.

via the Apache Kafka or Confluent Cloud Source Timeplus can pull data from Kafka continuously, save the data into Timeplus streams. As in most of the cases, saving data on Kafka is expensive and usually the user won’t keep it for a long time. By saving data to Timeplus stream, the user can configure a shorter retention time in Kafka/Confluent and more data is available in Timeplus for future analytics.
via Kafka Connect Timeplus Sink or Airbyte Connector Refer to my previous blogs 1 2 , using Kafka Connect, or other 3rd party data integration tools, such as Airbyte. In this way data is pushed into Timeplus, and users can have more flexibility on how Kafka brokers are exposed.
via Timeplus External Streams Another option is external streams, which means the query will directly read data from Kafka without persisting any data on Timeplus. In some cases, when there is no need to keep original data, such as building a data pipeline, an external stream is more cost effective.

Explore Streams

Two Timeplus streams have been created through Kafka Source, the logs stream `k8s_logs` and metrics stream `k8s_metrics`. Let's start our very first analytics, take a look at what those streams provide.

SELECT * FROM STREAM_NAME

The above query can be the starting point of every data engineer when exploring a data table. We call it “streaming tail” query. There are some difference compared to traditional SQL query:

The query is unbounded and pushed to the user, it will emit data as soon as there is new event come in
The query does not contain historical data, it will only emit results in real-time with the incoming new data since the query starts.

While Timeplus combines real-time data with historical data (check out How Timeplus Unifies Streaming and Historical Data Processing), users can turn the query into a bounded query just like a traditional database by applying table function to the stream:

SELECT * FROM table(stream_name)

Here are the query events from k8s_logs and k8s_metrics:

k8s_logs

message field contains the contain logs
ts is the event time where vector collect that log
namespace/node_name/… are those metadata related to the kubernetes container information

k8s_metrics

value metrics value
name metrics name
tags metrics labels in a json string, different metrics may contain different tags,
ts ss the event time where vector collect that metrics
type is the metrics type could be counter/gauge

Schema-on-read, from ETL to ELT

Logs, metrics and traces are three pillars to support the observability. As logs are usually unstructured text data, traditionally, there will be a ETL process to transform the unstructured data into a schemaful table and then loading into an OLAP system for analysis.

A key challenge of ETL is Schema Drift. As new applications are online or upgraded to a newer version of applications, there will be new logs in the system which require new extraction that is different from previous one. Sometimes new fields need to be extracted which was not included before due to new business requirements.

When a schema change happens, the old extraction rule no longer works and the data engineer has to build new rules and redo the ETL.

This is why ELT was introduced. Instead of extract/transform before load, just keep the raw data into the analytic system, and transform when you need it.

Here is one example how we turn the raw logs into a structure web access logs:

SELECT
  _tp_time, namespace, pod_name, grok(message, 'time=("?)%{DATA:time}("?) level=%{LOGLEVEL:level} msg=("?)%{DATA:msg}("?) clientIP=%{IP:clientIP} hostname=%{HOSTNAME:hostname} latency=%{INT:latency} method=%{DATA:method} path=%{DATA:path} referer=("?)%{DATA:referer}("?) reqID=%{DATA:reqID} size=%{INT:size} status=%{INT:status} userAgent=("?)%{DATA:userAgent}("?)') AS event
FROM
  k8s_logs
WHERE
  (length(event) > 0)AND(namespace ILIKE 'timeplus%')

The one sample event of above query result is

The grok function is similar to regular expressions, but it is more intuitive. It will check each log event and match the specific pattern, and then extract all the fields based on that pattern. Now we’ve got the structure event as a map with both key and value are string type.

The term "grok" was popularized by Robert A. Heinlein's science fiction novel "Stranger in a Strange Land," where it was used to describe a deep and intuitive understanding of something. In the context of log analysis, "grok" implies gaining an understanding of the log data by breaking it down into its constituent parts and making sense of it.

WITH logs AS
 (
   SELECT
     _tp_time, namespace, pod_name, grok(message, 'time=("?)%{DATA:time}("?) level=%{LOGLEVEL:level} msg=("?)%{DATA:msg}("?) clientIP=%{IP:clientIP} hostname=%{HOSTNAME:hostname} latency=%{INT:latency} method=%{DATA:method} path=%{DATA:path} referer=("?)%{DATA:referer}("?) reqID=%{DATA:reqID} size=%{INT:size} status=%{INT:status} userAgent=("?)%{DATA:userAgent}("?)') AS event
   FROM
     k8s_logs
   WHERE
     (length(event) > 0) AND (namespace ILIKE 'timeplus%') AND (_tp_time > (now() - 1h))
 )
SELECT
 window_start, p90(to_int(event['latency'])) AS p90_latency
FROM
 tumble(logs, 5m)
GROUP BY
window_start

Next, we can build some structure analysis based on it

Using CTE to define some intermediate result from the grok pattern, which can be referred in later part of the query
Using _tp_time > now() - 1h to include event happened 1 hour ago
Using event[‘field’] to refer a field from the event
Using to_int() to convert string type to int type
Using tumble to group all events into fixed time window, popular processing tool in streaming processing
Using p90 to get 90th percentile of web access latency

Without any external ETL process, Timeplus can build a structured data query at run time, this is what we call schema-on-read and it provides the best flexibility to your business.

View and Materialized View

The preview grok function based query is good, but it is too long to write. Don't worry, you can create a view or materialized view based on it.

A view is like a macro, it is a logical representation of a query. Similar to the CTE in the previous query which creates a temporary query instance can be referred later. A view creates a permanent reference of a query and can be used in any query.

A materialized view is a physical instance of the query, which does not only run the query in the background, but also saves the query result as a stream. When you query the materialized view, all the results have already been processed by the background queries. So it is super fast, with the cost of background computing and storage to save those results.

By creating a view of that grok function, our query to analysis web access latency can be simple as:

SELECT
 window_start, p90(to_int(event['latency'])) AS p90_latency
FROM
 tumble(view_web_access_log, 5m)
GROUP BY
window_start

Dashboard

Here are some of the dashboards we build to support monitoring of the kubernetes infrastructures, containers, applications, user activities and business usages. We have obtained SOC2 Type 1 for Timeplus Cloud. The following data are from our testing server, with live or mocked data.

Alerts

Utilizing streaming queries for monitoring offers a significant advantage: the ability to promptly initiate actions upon event occurrence. This means that an alert becomes valuable in situations where anomalies arise and measures need to be taken as soon as possible.

When working with streaming data, one of the key metrics is data quality: how much data is ingested right now or past 1 week? Are all data sources working well? Did any source stop working unexpectedly?

Using streaming query can make such monitoring simple and efficient.

Assuming there are two IoT sensors, data is collected from both sensors and sent to a Kafka Topic called `iot`.

The data collected from sensors looks like this:

{
"_tp_time":"2023-08-21T17:49:01.932Z",
"device":"device_1",
"number":20,
"time":"2023-08-21 17:49:01.000"
}

Each sensor will generate 5 events per second.

To monitor the Kafka topic, using external stream is a good choice. There is no data persistent when using external stream, it will directly query data on that Kafka topic without saving any data.

With the following SQL, we can monitor the throughput of that Kafka topic.

WITH iot AS
 (
   SELECT
     to_time(raw:_tp_time) AS _tp_time, *
   FROM
     es_sample_iot
 )
SELECT
 window_start, count()  as eps
FROM
 tumble(iot, 1s)
GROUP BY
window_start

The above query will count how many event for each 1s tumble window on that Kafka topic, it should return 10 if both sensors are working.

In case one of the sensors is breaking the eps will turn to 5 and if both are down, the eps will be 0.

But as most likely the eps won't stick to these constants, what we really care about is the change of throughput.

It is easy to monitor the throughput change using streaming SQL:

WITH iot AS
 (
   SELECT
     to_time(raw:_tp_time) AS _tp_time, *
   FROM
     es_sample_iot
 ), iot_agg_5s AS
 (
   SELECT
     window_start, count()  AS eps
   FROM
     tumble(iot, 1s)
   GROUP BY
     window_start
 )
SELECT
 eps, lag(eps) as prev_eps, (prev_eps - eps)/prev_eps as eps_drop
FROM
 iot_agg_5s
where eps_drop > 0.2

The above query will calculate the throughput change for two consecutive 1s windows. If the eps dropped 20%, it means something is wrong. We can create a sink using this query, send the result to a notification channel such as email, sms or slack.

Note: the lag function will return the previous value for specific fields

A notification will be triggered as soon as one of the sensors stops working.

Summary

Using Confluent and Timeplus together is a winning combo for cloud observability. Timeplus uses its own streaming query, views, dashboards and alerts to solve the observability challenges.

As a streaming data analytic platform, Timeplus solves many key challenges of observability:

Timeplus provides super low latency, continuously pushing queries that respond as soon as new events happen;
Timeplus’s streaming SQL is very close to standard SQL with some streaming feature extensions. It is easy to learn and use;
Timeplus supports powerful processing functions to handle unstructured logs data analysis. No ETL is required, which simplifies overall observability stacks;
As a streaming data platform, users can import other business/application data. So users can correlate the data from different domains and get the overall insight of the whole system.

Let us know how we can help support your own cloud observability efforts. Get in touch with us for more information: gang@timeplus.com.

WHY TIMEPLUS?

PRODUCT

DEPLOYMENT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Unlocking Cloud Observability with Confluent and Timeplus Cloud

About Timeplus Cloud

Onboarding Kafka Data

k8s_logs

k8s_metrics

Explore Streams

k8s_logs

k8s_metrics

Schema-on-read, from ETL to ELT

View and Materialized View

Dashboard

Alerts

Summary

Related Posts