Build a Real-Time App for GitHub in 2 Minutes

Updated: Aug 25

Build a Real-Time App for GitHub in 2 Minutes

We all love GitHub. But do you know what’s trending on Github right now? Do you know which repos have received the most pushes or PR reviews over the past 10 minutes? There are daily/weekly leaderboards at https://github.com/trending, but no real-time feeds.

At Timeplus, we love exploring new ways to support developers to build their own real-time insights. In this spirit, I created what I hope will be a useful real-time GitHub analytics tool for all of our developer friends. And by showing how I easily built a real-time dashboard for Github using Timeplus, it’s also a great initial tutorial for our platform.

Everyone is invited to try the live demo at: https://timeplus-io-streamlit-apps-demo-wjt6x1.streamlitapp.com/

Here is a screenshot:


A real-time dashboard built with Timeplus API and GitHub API, with 120 lines of code

If you zoom in a bit, you can see what repos were just created a few minutes ago, and which repos got a lot of stars. Even the total event count increases every single second.

What’s more exciting is: besides the dashboard, you can run ad-hoc queries to explore the data and build real-time charts.


Get real-time insights for GitHub in Timeplus

That’s Timeplus! A new platform for everyone to build fast, powerful and intuitive real-time analytics, with just a few clicks or a few API calls. Want to learn how to build your own powerful real-time analytics? Read on.

 

To build real-time analytics applications with Timeplus, you only need to follow two steps:

  1. Load real-time data from GitHub and send them to Timeplus

  2. Write SQL to explore the data and find answers

Yes, you read it right. No need to setup Amazon Kinesis, Apache Kafka, or Apache Flink. Instead, just write a dozen lines of Python code and use SQL for streaming analytics.

Create a data stream in Timeplus

There is no “table” in Timeplus. Data lives in “Streams”. Create a stream for the GitHub data:




This screenshot is fairly straightforward. All GitHub events contain the id, created_at, actor, type, repo attributes, and the payload is a JSON string with different schemas for different event types. We will demonstrate how Timeplus flexibly processes JSON data shortly.

Load real-time data from GitHub

We will be using GitHub Events API. It’s recommended to create a dedicated Personal Access Token(PAT) to call the REST API. You don’t need to select any checkbox while creating the token, since we will only list all public events. Please note, although Timeplus is a real-time processing system, the public events from the GitHub API has a 5-minute delay (source).

We delay the public events feed by five minutes, which means the most recent event returned by the public events API actually occurred at least five minutes ago — GitHub API Docs

To load data, I chose PyGithub which is simple to load data with few lines of code.

import os 
from github import (Github)
g=Github(os.environ.get("GITHUB_TOKEN"),per_page=100)
events=g.get_events()
for e in events:
  # github live data in `e`

Send real-time data to Timeplus

At Timeplus, we provide both REST API and SDK to make it easy for you to push data to our cloud service.

The first half of the code is to setup a secure connection with the Timeplus server, with credentials as environment variables. Stream().name("gitHub_events").get() to get a handler for the data stream. s.insert(..) to add records in batch.

That’s it! With just 15 lines of Python code, you can push GitHub live events to Timeplus. We know building an effective real-time data stack isn’t easy. We focus on solving these messy real-time analytics details, so you can focus on delivering the core applications/functionalities that will drive success for your organization.

What’s more, you can analyze these live data in our Timeplus Console with SQL or streaming charts. For example:

  • SELECT count(*) FROM github_events to show the latest count of streaming events, automatically updated every 2 seconds

  • SELECT window_start, repo, count(*) FROM tumble(github_events,30s) GROUP BY window_start, repo to count events by repo every 30 seconds

  • SELECT window_start, repo, count(*) FROM hop(github_events, 1s, 30s) GROUP BY window_start, repo to count events by repo every 30 seconds and update results every second

Timeplus provides out-of-box live tables and charts. You can also use your favorite Business Intelligence tools to visualize the data.


Sample dashboard in Redash

The first screenshot in this blog is from a streamlit app. With our Python SDK, it’s very easy to build such an app with few lines of code. For example, to show such top 10 active repos in the past 4 hours like this:

You only need to write about 10 lines of code.

Last but not the least, I want to show you how to use Timeplus’s flexible JSON support. The payload attribute in the data stream is actually a JSON string. Different types of GitHub events generate different types of payload. For example, CreateEvent, which is triggered when a git branch or tag is created, generates the following sample JSON payload:


{"ref": "0.0.7", "ref_type": "tag", "master_branch": "main", "description": null, "pusher_type": "user"}

For example, if we want to understand what are the most popular master_branchnames, we can run the following query:



SELECT payload:master_branch AS master_branch,count(*) AS cnt FROM github_events WHERE type='CreateEvent'GROUP BY master_branch ORDER BY cnt DESC

When I observe the live events for 10 minutes, I am happy to see there are more branches created with main instead of master !


|master_branch|cnt |
|-------------|----|
|main         |4102|
|master       |2486|
|develop      |96  |
|dev          |47  |
|development  |15  |
|rc-22.3000   |8   |
|gh-pages     |7   |

Timeplus supports many JSON functions. You can also create different views for different types of events to make the query more straightforward.


CREATE VIEW github_create_events AS
SELECT id, created_at,actor,repo,
payload:ref AS ref,
payload:ref_type AS ref_type,
payload:master_branch AS master_branch
FROM github_events WHERE type='CreateEvent';

-- filter by JSON attributes
SELECT * FROM github_create_events WHERE master_branch='main'
 

In summary, we believe building fast and powerful real-time analytics should be much easier and more intuitive than it is now. For this first tutorial, we demonstrated

  1. how to load data from a GitHub Events API into Timeplus with less than 20 lines of code

  2. how to query fresh data right away with out-of-box streaming query and live charts. REST API and Python SDK are available to integrate Timeplus with your infrastructure and toolchain.

You can try the Timeplus playground today to run streaming queries or check our website to join our private beta or read more on our product offerings.

We will open-source our Python SDK very soon. Be sure to check it out. 【Update: the SDK is open-sourced https://github.com/timeplus-io/gluon 】 We can’t wait to hear how you use it to build your own amazing real-time insights, and how we can further help you on your real-time journey. As always, we are grateful for your feedback and support.

125 views

Recent Posts

See All