Kevin Liu is an intern at Timeplus working with our Chief Technology Officer, Gang Tao. He is a fourth year student at UCLA studying applied math + stats. He plans to pursue graduate studies in applied math, focusing on machine learning and deep learning theory.
The ability to process and analyze real-time events is crucial for maintaining the security and efficiency of online systems. Cyberattacks such as Distributed Denial of Service (DDoS) are becoming more common as criminals and other actors choose to disrupt the functioning of services for financial or political gain. In this blog, I will explore the application of Timeplus in conjunction with OpenAI’s large language models (LLMs) for real-time DDoS attack detection.
Timeplus is a real-time streaming analytic platform which provides a simple SQL interface to process streaming data and provide insights instantly. The benefits of using Timeplus for this application lie in its simplicity and end-to-end capabilities that covers the whole data analysis life cycle including ingestion to presenting. The core engine of Timeplus is designed and implemented as a single binary, meaning it is extremely easy to install and set up. Timeplus also offers powerful features to support the entire analysis pipeline for an online network, from ingestion of network data to the alerting of DDoS traffic. All of this is handled by Timeplus in real-time.
Utilizing Timeplus streaming SQL, users can execute SQL queries to detect DDoS attacks by analyzing data flow sequences. This method efficiently identifies DDoS patterns that can be clearly defined by specific rules. However, some attack patterns may not be easily defined by simple rules, and this is where a Large Language Model (LLM) can be beneficial. The following diagram illustrates a typical workflow for DDoS detection scenarios. By integrating an LLM into the streaming SQL engine, users can discover new attack patterns by providing the LLM with sample attack patterns through in-context learning. This integration enhances existing rule-based analysis with greater flexibility, enabling the detection of previously elusive attacks.
In the above flow:
IP flow data is sent to a message bus like Apache Kafka
Timeplus reads Kafka data in real-time
A DDoS analysis query runs in real-time on new IP flow data
In the DDoS analysis SQL function, the OpenAI API is called, using LLM models to detect anomaly patterns in the stream
An alert is triggered as soon as an attack has been identified and sent to the downstream system, usually a rule engine
The rule engine will send attack IPs to the network equipment and block attackers
I used OpenAI GPT-3.5 with few-shot prompting methods to “teach” the model what DDoS traffic looks like, then “ask” the model to label a piece of network traffic in real-time. Few-shot prompting involves providing examples to an LLM to give the model some context for more complex tasks. The selection of the model and method are based on this paper about applications of LLMs for DDoS detection and other cybersecurity risks. In this paper, the researchers used a “few-shot” method that constructs prompts by first randomly selecting 10 samples of labeled network traffic data from the CIC-IDS2017 dataset (intrusion detection evaluation dataset). The features of this data are: minimum backward packet length, standard deviation of backward packet length, average packet size, flow duration, and standard deviation of time between packets. The labels for the data are “Benign” or “DDoS,” indicating the type of network traffic. These 10 samples are used as context for the LLM. Then, a sample of unlabeled data is provided as a test case. The study found that LLMs can achieve 90% DDoS detection accuracy on the CIC-IDS2017 dataset (intrusion detection evaluation dataset).
One of the greatest benefits of using an LLM instead of traditional machine learning or deep learning methods for this binary classification task is the accuracy of LLMs. The researchers of the LLM DDoS detection paper found that in all tests, OpenAI GPT-3.5 outperformed a neural network trained for the same task. Clearly the accuracy benefits of LLMs warrant their application to the task of detecting DDoS network traffic. The benefits do not stop there: one even more exciting benefit of LLMs is that they can provide reasoning for their answers. Where traditional machine learning methods only allow predictions on whether network traffic is DDoS, an LLM can make a prediction as well as give the reason for why it made that prediction. LLMs combine accurate predictions with reasoning capabilities and are a powerful tool for modern DDoS detection systems. Combining these capabilities with the real-time processing power of Timeplus, we get a system that is highly effective at real-time DDoS detection while being expressive.
Let’s take a look at the design for this system:
We have a historical training dataset and a live data source of network traffic that are processed using Timeplus in real-time. This processing is handled by a Remote User-Defined Function (UDF) that creates a prompt using the inputs and calls an LLM that will detect DDoS traffic. The result is that the LLM can fairly accurately label the incoming traffic as DDoS/Benign.
All code and data can be found: https://github.com/kliuc/openai-ddos-udf
To create a working sample, we first need a live data source. To emulate real network traffic, we can use a Proton random stream as follows:
CREATE RANDOM STREAM network(
bwd_packet_length_min float default rand()%7,
bwd_packet_length_std float default rand()%2437,
avg_packet_size float default rand()%1284 + 8,
flow_duration float default rand()%1452333 + 71180,
flow_iat_std float default rand()%564168 + 19104
) SETTINGS eps=0.1
This will create a random stream of data from within Timeplus itself similar to the CIC-IDS2017 dataset that we can test on. Of course, we could use an external data generator here as well. Here are some of the generated data:
Note: we use the setting eps=0.1 to limit the generation of data to one event per 10 seconds in order to account for the latency involved with making an OpenAI API request.
We will leverage a Remote User-Defined Function (UDF) to handle the DDoS detection. Timeplus allows UDFs to be defined locally or as a remote service. For the purposes of this exercise, I have chosen a Remote UDF function in order to be able to make use of the python environment and libraries we need. The function, written in Python (see is_ddos.py in the Github repo), implements the DDoS detection algorithm as detailed in the research paper. First, 10 samples of labeled training data are selected from the CIC-IDS2017 dataset. The function then takes a single piece of testing data to construct a prompt to send to the OpenAI API. The prompts look like the following:
{System}
You will be provided with a sample of network traffic data that is split between training data and a single testing data (separated by '###'). Each row of data is separated by a newline, and each row has features that are separated by a pipe symbol ('|'). Using information from the training data, predict the best label (BENIGN or DDoS) for the testing data. First explain your reasoning for the selected label. Then indicate the predicted label with '@@@' on each side.
{User}
Bwd Packet Length Min: A | Bwd Packet Length Std: B | Average Packet Size: C | Flow Duration: D | Time Between Packets Std: E | Label: BENIGN
Bwd Packet Length Min: A | Bwd Packet Length Std: B | Average Packet Size: C | Flow Duration: D | Time Between Packets Std: E | Label: DDoS
Bwd Packet Length Min: A | Bwd Packet Length Std: B | Average Packet Size: C | Flow Duration: D | Time Between Packets Std: E | Label: BENIGN
Bwd Packet Length Min: A | Bwd Packet Length Std: B | Average Packet Size: C | Flow Duration: D | Time Between Packets Std: E | Label: DDoS
.
.
.
###
Bwd Packet Length Min: 0 | Bwd Packet Length Std: 0 | Average Packet Size: 0 | Flow Duration: 0 | Time Between Packets Std: 0
Let’s take a closer look at the is_ddos function:
def is_ddos(bwd_packet_length_min, bwd_packet_length_std, avg_packet_size, flow_duration, flow_iat_std):
# read training data and select samples/features
friday = pd.read_csv('Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv')
friday.columns = [column.strip() for column in friday.columns]
train = friday[['Bwd Packet Length Min', 'Bwd Packet Length Std', 'Average Packet Size', 'Flow Duration', 'Flow IAT Std', 'Label']]
train = train.sample(10)
test = pd.DataFrame([[bwd_packet_length_min, bwd_packet_length_std, avg_packet_size, flow_duration, flow_iat_std]])
# helper function to format data into string format useable as a prompt
def promptify(df):
column_names = ['Bwd Packet Length Min', 'Bwd Packet Length Std', 'Average Packet Size', 'Flow Duration', 'Time Between Packets Std', 'Label']
formatted_rows = []
for index, row in df.iterrows():
formatted_row = ' | '.join([f'{column_names[i]}: {row.iloc[i]}' for i in range(len(row))])
formatted_rows.append(formatted_row)
interleaved_rows = []
while len(formatted_rows) > 1:
interleaved_rows.append(formatted_rows.pop(0))
interleaved_rows.append(formatted_rows.pop(-1))
if len(formatted_rows) == 1:
interleaved_rows.append(formatted_rows[0])
return '\n'.join(interleaved_rows)
# assemble the prompt and call OpenAI API to get completion output
system_prompt = '''You will be provided with a sample of network traffic data that is split between training data and a single testing data...'''
user_prompt = promptify(train) + '\n###\n' + promptify(test)
completion = openai.chat.completions.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': user_prompt}
]
)
output = completion.choices[0].message.content
print(output)
# extract and return the predicted label from the completion output
label = re.search(r'(?<=\@{3}).+(?=\@{3})', output).group().strip().lower()
return label == 'ddos'
The function is_ddos inputs five arguments which are each features associated with a single network flow. It outputs a boolean value indicating whether or not the flow is a DDoS flow. First, is_ddos reads the training dataset as a Pandas dataframe and selects the appropriate columns corresponding to the training features and the label. Then, five samples of benign network flow and five samples of DDoS network flow are randomly sampled from the dataset. Additionally, the five input arguments are assembled as a Pandas dataframe. I use a helper function, promptify, to convert the training and testing dataframes into formatted strings usable as prompts. Each row of data is made to look like:
Bwd Packet Length Min: A | Bwd Packet Length Std: B | Average Packet Size: C | Flow Duration: D | Time Between Packets Std: E | Label: BENIGN
Additionally, the 10 training samples are formatted so they alternate between benign and DDoS samples (benign, DDoS, benign, DDoS, benign, etc.). We do this because the researchers found that evenly distributing relevant training data in the context yielded the best results. Now that all the necessary parts are collected, the entire prompt is assembled and sent to OpenAI GPT-3.5 using the OpenAI API. The output contains a paragraph of reasoning followed by the predicted label. is_ddos uses regex to extract the predicted label, and finally outputs the boolean value indicating whether the prediction is DDoS or benign traffic.
Now, we need to deploy our system and create the UDF. First, make sure to create a .env file which which contains your OpenAI API key:
OPENAI_API_KEY= key here
Then,
docker build -t ddos_detection .
docker compose up
To create the UDF, we register a new UDF in Timeplus Console, making sure to add all arguments, setting appropriate data types, and setting the URL via which the remote UDF will be called.
Or we can run DDL to create the remote UDF:
CREATE REMOTE FUNCTION is_ddos(bwd_packet_length_min float64,
Bwd_packet_length_std float64,
avg_packet_size float64,
flow_duration float64,
flow_iat_std float64
) RETURNS bool
URL 'http://ddos-server:5001/is_ddos'
Finally, we can test our real-time DDoS system! Run the following query:
SELECT *,
is_ddos(
bwd_packet_length_min,
bwd_packet_length_std,
avg_packet_size,
flow_duration,
flow_iat_std
)
FROM
network
Here is the query output:
The last column “is_ddos” shows the output from the LLM, tagging each piece of new network data in real-time! As we can see, some examples of DDoS traffic flow have been detected!
We can look at the ddos-detection server logs to understand why some network traffic was labeled as DDoS. Let’s take a look at the response for the first row of the query output:
The reasoning capability of LLMs is in action here, providing precise reasons for the prediction of DDoS. The model identifies higher values for "Bwd Packet Length Min", "Bwd Packet Length Std", "Average Packet Size", "Flow Duration", and "Time Between Packets Std" to be indicators of DDos network flow.
In this blog I have shown the functionality of using Timeplus and LLMs for real-time DDoS detection. These two tools work excellent together and are very promising for events processing tasks:
Timeplus offers simple and easy-to-learn continuous monitoring capabilities. This whole project took me only about half a day’s time to develop and test. For new users, experimenting with this setup is as simple as running the project in Docker, then writing just two SQL queries (one to create a stream, then one to query real-time DDoS detection).
LLMs have proven to be very smart, flexible, and extremely capable of learning complex patterns with minimal training. In this example, all the sample data and inference data are included in the prompt sending to LLM, there is no need to train the model. Adding new samples to the prompt is easy to implement, the only thing needed to consider is the size of the context supported by the LLM.
LLMs have limitations as well, one of the issues is it requires a significant amount of time to generate responses. it is not ideal for real-time detection. In my implementation for DDoS detection, the latency between incoming network traffic and detection is up to 10 seconds, which is not ideal. A typical fix for this issue would be to include embedding functionality and a vector search that can store the new network data as it is labeled in real time. Instead of making a new API call for each incoming piece of data, the system would embed and store network data in real-time as it is labeled DDoS/Benign. This approach allows the system to quickly reference previously labeled data for similar network flows, bypassing the need for redundant API calls.
Another limitation is data privacy, for enterprise users, sending private data to OpenAI or other Cloud LLM is really a concern. Privately hosted LLM could be a solution.
This blog demonstrates one promising application for cybersecurity, but a similar setup can be applied to several different use cases (eg. finance, location-based systems, etc.). Timeplus does an excellent job of handling any real-time events processing needs.
Comments