Architecture Deep Dive

Video Streaming at Scale with Kubernetes and RabbitMQ

Deep dive into the problems video streaming sites face and how they can architect their infrastructure to manage the load.

Alexandre Olive

Published in

skeepers

7 min readOct 9, 2023

TV showing a disney+ screen to initiate the content around streaming — Photo by Marques Kaspbrak on Unsplash

Streaming. That’s a word we hear a lot nowadays. Most of us use Netflix or YouTube daily. So much it has become part of everyone’s life, maybe too much for our own sake.

But people rarely stop and wonder: how does it even work? From my developer’s point of view, it’s pure madness. There’s so much data to store and pass through the network, people worldwide should be able to access it without lag or issues, and it needs to work on all devices.

I will not pretend I know how those apps work internally. They probably use concepts I never dreamed of to optimize every inch possible.

But don’t leave just yet; there’s still a reason I’m writing this article. I want to use my direct experience working as a technical lead on a streaming solution at Skeepers to explain how we manage to produce high-quality videos and stream those videos directly onto our client’s website, just like you would watch a video on YouTube.

I will discuss technical subjects like Kubernetes, RabbitMQ, and load balancers. A basic knowledge of those topics is necessary to follow the article.

To clarify, I’m talking about streaming as watching a video online that is not a live stream. A regular video on YouTube is still called video streaming.

The video’s life: from upload to playback

I will take you on a journey from when a user uploads a video on our site to when you play it on your device and the challenges that come with it.

An editing studio that processes a video — Photo by Jakob Owens on Unsplash

First step: Video upload

Alright, the first step is when the video is uploaded. We have yet to determine what format, codec, or even which resolution the video will be.

First, it will be normalized, which means we will transform all the videos in the same format (first mp4), stabilize the video, and harmonize the sound to mitigate shaking or loud sounds.

We then break the video into multiple small chunks; the resulting format will be an adaptive bitrate streaming format. There are different standards for adaptative bitrate streaming, like HLS or MPEG-Dash. The first one, HLS, is developed by Apple and is native on IOS. The Moving Pictures Group created the second one, MPEG-Dash, as an alternative to HLS.

You can read more about Adaptative bitrate streaming here.

An Introduction to the Difficult World of Video Processing

Dive into the video processing world and discover what problems mainstream platforms like YouTube or Netflix face when…

alexandreolive.medium.com

This task is time-consuming, so the user cannot just wait for the API’s response synchronously. It needs to be asynchronous.

A custom-made schema representing our use of RabbitMQ and FFmpeg. It is highly simplified with RabbitMQ in the middle receiving messages from a node API. And a node worker polling messages and storing resulting files in a gcp bucket. — Custom-made schema representing the simplified architecture.

You can see the asynchronous implementation using RabbitMQ on this schema. When a user uploads a new video, it first gets uploaded to a cloud storage. We use Google Cloud Storage, but it could be any storage. Once the upload finishes, we create a processing task in the database and send the task ID to the queue. The user screens update with a message that says the processing is ongoing and please wait a few minutes.

A NodeJS worker is constantly polling the queue, waiting for new tasks like a good soldier. When a new one is available, it gets the processing task information from the database. It uses FFmpeg under the hood to do the required job (normalization or adaptative bitrate streaming format creation, for example), store the resulting files in storage, and update the task’s status in the database.

You’re probably thinking: “Hang on, there’s Kubernetes in your article title but not in the schema”, or “It’s not scalable as is. If there are a lot of videos, it will crash.”

I’m getting there! The schema I presented was just an introduction. I am gradually introducing each concept.

There are indeed a couple of issues here. If there is only one API and one worker, it will quickly overload. Primarily since FFmpeg requires a lot of resources. Let’s spice it up!

Custom-made schema representing the more complex architecture with Kubernetes.

Alright, we introduced Kubernetes in the mix. The NodeJS API handling user calls will receive many HTTP calls with user files. So, instead of having one API instance, it’s now a variable number of Kubernetes pods. We have set it up to auto-scale in a way that if the RAM or CPU of the pods reaches a specific limit (70% of their capacity), it will launch a new pod for this same API.

There is no “Kubernetes node“ reference to simplify the schema, but it can also scale across nodes. If the number of pods (actually the required capacity of the pods) reaches the capacity limit for the current node, it will auto-scale a new node and start launching new pods inside that node.

The load balancer in front of the API will share the HTTP calls randomly between all the existing pods to split the load.

The worker polling the RabbitMQ is also auto-scaling, but it’s not on resources; it is scaling on the number of messages waiting in the queue. The more messages await, the faster we need to process them, so launching new workers is the way to go.

It’s much better, but we want to save costs when possible, so let me introduce preemptible nodes!

Preemptible VMs are Compute Engine VM instances that are priced lower than standard VMs and provide no guarantee of availability — Google Cloud documentation.

Custom made schema representing the more complex architecture with Kubernetes and preemptible nodes. — Custom-made schema representing the more complex architecture with Kubernetes and preemptible nodes.

We want our user to have the best experience, but it’s perfectly fine for us if they don’t have access to the best video quality in adaptative bitrate streaming format right as they upload their content. It can take a few minutes for a preemptible node to be available, but they are cheaper than a normal node.

In this new schema, we transformed our NodeJS worker with FFmpeg to what we call internally a “spawner”. It still polls the RabbitMQ queue, but instead of processing the video itself, it will launch a new Kubernetes pod in a preemptible node, and this new pod will do the processing.

This way is cheaper. We switched all our resource-intensive tasks in a preemptible node, so we don’t need to scale our spawner as much in a normal-priced node.

Google Cloud can kill preemptible nodes if a normal node needs the resources. We must ensure a failed task goes back to the queue and starts again by another node later.

Terminated when Compute Engine requires the resources to run standard VMs — Google cloud documentation

This new setup with preemptible nodes brings more overhead in the implementation but a significant cost improvement. We reached a point where we could scale horizontally indefinitely.

Second step: Video playback

Alright, at this point in the process, we uploaded and transformed our user video to stream it on our site. A video could be hundreds of megabytes split into thousands of chunks.

The video HTML tag does not support adaptative bitrate streaming formats like HLS or MPEG-Dash by default, so we must use a custom player. The two most used players to handle streaming are HLS.js and Skaka Player. They both use Media Source Extension to be able to handle this format. I won’t go into more detail about players and MSE as it’s not the goal of this article; you can read more by clicking on the link I provided.

To prevent loading the video files from the cloud storage each time someone tries to play the video, and therefore pay a lot of money to the cloud provider, we use a Content Delivery Network (CDN).

Custom schema showing the process of using a CDN

The user lands on the site to stream the video. All calls to retrieve the video go through our CDN provider, Cloudflare. A CDN will cache the content at the edges, which means if the content you are requesting is not present in their cache, it will demand it to the URL you provided.

The “edge” part means that depending on where you are in the world, it will store it in a server close to you (regionally) so that if a user in your region asks for the same content, he gets it blazingly fast. If a user from another side of the world asks for the same content, it will do the same process and store it close to that user.

Each video chunk is a separate file, so the waiting time for the unlucky user who has to create the cache is still relatively low. Not thousands of megabytes to download at once.

I skipped some essential parts of our architecture like micro-services, WebSockets, Redis pub/sub, or webhooks to keep the focus on Kubernetes’ auto-scaling capabilities in combination with RabbitMQ asynchronous queues. I’ll probably write another article about “communication” in our architecture soon.

YouTube probably has an implementation that differs significantly from ours, but it’s already a good look into what a complex system with Kubernetes might look like.

I would love to be a little mouse and peek at YouTube’s complete architecture to see how far we are from them. I might want to contact Ratatouille’s movie creator to do so; it’s a real story right?