Streaming Database: 9 Things You Need To Know
Welcome to a New Era of Streaming Databases 🎉
For data engineers and analysts, you should have “Streaming Database” in your toolbox, to simplify the data stack, reduce end-to-end latency, reduce costs, and deliver faster. I listed 9 things you need to know about streaming databases.
1. Streaming Database
A streaming database is a type of database that can handle and process real time, continuous streams of data. It’s designed to handle high volumes of data with low latency and high throughput, and is often used in applications that require real-time data analysis, such as financial trading, fraud detection, and IoT.
A key difference between a streaming database and an OLAP database is whether it can run stateful streaming SQL. For example, to get the latest conditional count, is it necessary to run the query every X seconds, or just run a query and keep getting new results?
2. Streaming SQL
Also known as Continuous SQL, this is a programming language used for querying and processing real-time data streams in a streaming database or stream processing platform. It is a variation of SQL and is being standardized with the INCITS/Data Management Technical Committee.
Streaming SQL can be used to filter, aggregate, transform, and join unbounded streaming data from multiple sources and push new results to the clients (via WebSocket/SSE), or save internally as a Materialized View.
3. Materialized View
In the context of a streaming database, a materialized view is a pre-computed summary of a real-time data stream that is continuously updated as new data arrives.
It is a database object that stores the results of a streaming SQL, allowing for faster access and analysis of the data. It can be used to reduce the time and resources needed to process large amounts of streaming data.
Common data sources for streaming databases are: Apache Kafka, Redpanda, Apache Pulsar, Amazon Kinesis, Google Cloud Pub/Sub, etc.
Streaming databases can continuously load streaming data from such sources, and save them in the database.
Alternatively, real-time data can be pushed to streaming databases with REST API or other protocols.
Files and records in traditional databases, SaaS, object storage, or data lake can be imported to streaming databases with built-in or 3rd-party connectors.
Also known as Destination, or Downstream, the results of streaming SQL can be sent to:
Message bus: Apache Kafka, Redpanda, Apache Pulsar, Amazon Kinesis, Google Cloud Pub/Sub
Database, data warehouse, data lake
S3 and other object storage, file system
Notification systems, e.g. Slack, Teams, PagerDuty, or any Webhook
API consumer or BI system, via WebSocket or SSE
6. Stream, Event Time, and Substream
A streaming database creates and manages data streams, in the similar way as a OLAP database creates and manages tables. One stream represents one type of real-time data, usually in the append-only mode.
All events in the streams have a timestamp associated with them. It should reflect when the event happened. This event time is critical to determine whether the event is too late to be processed (under the Watermark) or out-of-order.
Some streaming databases also support sub-stream, a way of partitioning a data stream into smaller, more manageable pieces that can handle watermark and out-of-order independently. For example, in the use case of Fleet Management, sensors on different trucks may have different time skew (the gap between the event time and when it is processed or analyzed).
7. Late Event and Watermark
There could be many reasons why the streaming data comes late. Such as network latency, or the clock on the device is later than the wall clock. If we are looking for the total number of orders for each minute, we don’t want to keep waiting for the late events and delay the aggregation. That’s why a “watermark” is introduced.
A watermark is a timestamp that represents the “current” point in a streaming SQL, for example a watermark might represent the time at which the most recent sensor reading was received. As new sensor readings arrive, the watermark is updated to reflect the current position. Late events or out-of-order events which are below the watermark won’t be included in the streaming computation.
8. Time Windows: Tumble, Hop, and Session
It’s common to aggregate the streaming data with a certain time window.
9. Change Log Capture (CDC)
CDC captures and stores changes made to data as it is updated, inserted or deleted in a database or other data sources. These changes are then made available to the streaming databases for real-time processing or analysis.
Without CDC, data engineers have to wait for a batch process to run and analyze the data at a later time.
Popular open source CDC tools are Debezium and Maxwell’s daemon.
Welcome to a New Era of Streaming Databases 🎉
For data engineers and analysts, you should have “Streaming Database” in your toolbox, to simplify the data stack, reduce end-to-end latency, reduce costs, and deliver faster.