Optimizing ultra-low latency for your data visualization
At Timeplus, we are building a streaming-first real-time analytics platform. Our frontend needs to visualize a large amount of real-time data under ultra-low latency (you can try it here!). The traditional pull-based fetch API doesn’t fit our needs if we want to provide real-time updates to our users. A better solution here is to adapt the push model: when there is new data on the server side, the server should push the data to the frontend. This is how we can provide an update to the user continuously and instantaneously.
The main technologies which provide this kind of push model support are WebSocket (WS) and Server-sent events (SSE). There are already a lot of great articles comparing the features of these two options, so we won't repeat it here. However, we are also interested in the performance aspect of these two technologies. There aren't many articles out there discussing their performance, so we decided to find out ourselves.
Here are a few topics we are investigating:
Which one is generally faster?
How can we fine-tune them? (e.g. How can we find a balance between batch size and latency? Will the throughput increase linearly as batch size increases?)
How many resources will each consume for a certain throughput?
Before starting the test, let us briefly introduce our use case.
Our use case
At Timeplus, the two major frontend components requiring a push model are the streaming table and the real-time chart.
Streaming table
A streaming table needs real-time raw events from the backend server. Imagine a user running a query to search for all the raw events inside a stream. The server will need to stream all the events back to the browser, and then the browser will need to append the events into a table. The data volume here can be HUGE.
// Sample query
select * from car_live_data
// Results that streaming to the frontend. One event per row
[1,16.64609333871044,54.12955095924333,39,"73.12",934.1228226766616,0,"c00777","2022-07-13T21:04:08.419Z","2022-07-13T21:04:08.419Z","1970-01-01T00:00:00Z"]
[1,135.15407205751052,67.16106649459257,44,"49.18",923.1921453934982,0,"c00452","2022-07-13T21:04:08.419Z","2022-07-13T21:04:08.419Z","1970-01-01T00:00:00Z"]
[1,-110.6063911072836,18.11142726885563,83,"79.94",1251.1098362542486,0,"c00709","2022-07-13T21:04:08.42Z","2022-07-13T21:04:08.42Z","1970-01-01T00:00:00Z"]
...
Real-time chart
A real-time chart normally requires aggregated data. Generally speaking, the data volume is smaller.
// Sample query
SELECT window_start, max(speed_kmh) as speed_kmh FROM tumble(car_live_data, INTERVAL 1 s) group by window_start
// Results
["2022-07-13T21:09:40Z",49.45]
["2022-07-13T21:09:41Z",45.64]
["2022-07-13T21:09:42Z",48.02]
...
Ideally we’d like to render the component as frequently as possible. However, in order to balance performance and user experience, we've decided to cap our render interval at 200ms. Based on our experience, 100-500ms render intervals are reasonably good.
Test setup
To make things easier, we are using the following control settings as they are not part of the comparison:
Event payload
[0, "2022-03-29T20:35:56.581Z", 1, 116.9232297, 40.676318, 4745, "c00070", 78, "32.07", "1970-01-01T00:00:00Z", "column 1", "another field"]
Connection protocol: HTTP/2 (HTTP/2 is needed to bypass the 6 concurrent connections limitations for Server-sent events)
Interval per batch: 1ms (1,000 batches per second)
Compression: None
Chrome: Version 103.0.5060.53
Hardware: Apple M1 Pro with 10 cores (8 performance and 2 efficiency), 32 GB memory, macOS Monterey (12.4)
Test I: Batch size
The main purpose of this test is to compare different EPS and CPU utilizations with different batch size settings. All of the tests have their concurrency fixed at 10. Intuitively, we believe that we can have better throughput with larger batch sizes. What we want to find out about is whether WS and SSE have similar performance with the same batch size.
Firstly, let's look at the results of SSE. The chart below shows that the EPS and CPU utilizations increase rather linearly as batch size increased from 1 to 400, at 3,000,000 EPS or roughly 400 MiB/s. When the batch size reaches 400, the CPU utilization of the tab becomes 100%, which is the bottleneck of the test. Continuing to increase the batch size won't help after this point.
Next, let's look at the results of WS. Similar to SSE, the EPS of WS increases very linearly. We've observed that with the same batch size, the CPU utilization of WS is slightly lower than SSE. This means that WS can better leverage the CPU; thus, if other conditions are the same, the WS should support higher throughput than SSE.
Connection | Max EPS | Batch size | Client CPU % | Server CPU % |
WebSocket | 4,000,000 | 800 | 175 | 70 |
Server-sent events | 3,100,000 | 450 | 165 | 75 |
Test II: Concurrency
This test is similar to the previous test, but now we are focusing on how concurrency impacts performance. In this part, we will use 50 as the fixed batch size.
This first chart below shows SSE's results. Again, the EPS increases very linearly until a certain point. The max EPS we can get here is when concurrency equals 60, at which point the CPU utilization of the tab became 100%.
For WS, the results were similar to SSE at the beginning. When we increased the concurrency from 1 to 50, the EPS is almost identical to SSE's. However, we noticed two things:
WS has a lower client CPU utilization compared to SSE
After a concurrency of 50, we no longer see linearly performance increase. The max EPS we can get with WS is close but not as good as SSE (2.4 million vs 2.7 million)
Here is the comparison view of client CPU utilization between WS and SSE when the EPS are almost identical. Noticed that WS has a lower CPU utilization.
Test III: Real-world scenarios
In Tests I and II, we were just trying to push the limits of SSE and WS and see how far we can go. Practically speaking, it is just not possible to visualize such a large volume of data in the frontend. In this test, we are trying to evaluate more practical scenarios.
We assume the target of our latency is 50ms and there are 100,000 events to be streamed to frontend per second.
Batches per second: 20
Events per second: 100,000
Scenario 1: Fewer connections with high throughput per connection
Connections: 20
Events per batch per connection: 250
Scenario 2: Many connections with lower throughput per connection
Connections: 100
Events per batch per connection: 50
Scenario | Connection | Client CPU % | Server CPU % |
1 | WS | 35 | 6 |
1 | SSE | 23 | 8 |
2 | WS | 30 | 12 |
2 | SSE | 30 | 11 |
Other Weird Learnings
We love sharing learnings that might be helpful to the community. So, while not directly relevant to the performance testing, I thought I'd share some other findings that might save you a bit of time.
There will only be one Chrome helper process (Google Chrome Helper) handling the network for all the tabs. As a result, once this progress uses 100% CPU, starting new tabs won’t help. You
1 Tab 3,000,000 EPS
2 Tabs 4,100,000 EPS
3 Tabs 4,000,000 EPS
There is a great article that explains the multi-process architecture of Chrome. Check it out here if you want to learn more.
Chrome Devtool can significantly impact performance (CPU-wise)
When turning Devtool on, CPU utilization increases. Here is the result of one of our tests (Batch size 50, concurrency 30, WebSocket).
With Devtool off, the EPS was 1,350,000. The CPU utilization of the tab was around 75% while the network utilization was 55%
With Devtool on, the CPU utilization of the tab jumped to 130% (from 75%) and the CPU utilization of the browser jumped to 100% (from 0%!). Additionally, we noticed the EPS dropped to 1,000,000. The CPU utilization of the tab became the bottleneck of the EPS already.
Conclusion
Let's get back to the original question we asked in the beginning: which one is generally faster?
Well, the answer to it may seem to be boring - they are really close enough. Our tests didn't show much of a performance difference between SSE and WS across these scenarios. Practically, the majority of CPU should be spent parsing and rendering the data instead of transporting. Given the relative parity in performance, we recommended users focus on the functional side requirements when selecting. For example, if your application needs bi-directional communication or transmitting binary data, WebSocket will be your best choice. On the other hand, if you just want a simple protocol to sending UTF-8 data from server to client, Server-sent events should suit your needs.
We look forward to helping you get more out of your streaming data. Feel free to send any questions or comments to me at calvin.zhang@timeplus.io. You can also join our Timeplus community or sign up for our private beta to learn more.
Comments