Cloud Buckets Don’t Have to Be an Upload-and-Forget Mess

Why does no one think of deleting files from buckets? Today, we’ll discover how we cleaned up our cloud storage Roomba style!

Published in

ITNEXT

8 min readNov 6, 2023

multiple buckets overflowing with files as a van gogh drawing — DALL-E Generated (AI)

Like every child, I hated cleaning my bedroom. It was always a fight with my mom to do it, and it almost always ended up in me filling up my closet with everything. TADA, it’s clean! Please don’t open up the closest. She probably knew… sorry mom.

Today, it’s the same, except it’s not my mom; it’s my boss, and it’s not a bedroom; it’s cloud buckets. And the worst part is it’s not even my mess. I inherited it when I joined the company. What kind of heritage is that?

I have been a developer for more than ten years, and in all the companies I worked for, they all made the same mistake.

Files get uploaded and linked to an entity in a database. Still, when a user removes the entity, the file stays on the bucket without being deleted. It gets lost in a sea of other unlinked files, never to be thought of again. At least they’re keeping each other company.

Then the application gets bigger, and you have storage cost issues, but you also have GDPR issues. If you get audited and still store user’s files, that’s a hefty fine coming your way.

100 TB, that was the size of our buckets at my current company when I joined. I couldn’t tell what half of it was, and all the developers who built the application were gone. I proposed to burn it all down, but it weirdly was denied. Some people even said I’d get sued — always such overreaction.

So, in today’s article, I want to talk about our adventure in trying to set up an archiving process for existing files and all the new files in the future. We will discuss storage class, lifecycle rules, and automated tasks.

The current situation

The current situation is dire, and it’s getting worse as days pass. We need to take care of it.

Google cloud storage bucket size details

We have three main buckets:

processing-outputs: This is where we store everything produced by our FFmpeg processing. Normalized video, thumbnails, cut or cropped videos, final version, etc.
upload: As the name says, it’s where we store files uploaded from users.
streaming: where we store final streaming assets like adaptative bitrate streaming video formats and other assets used in our player.

The icing on the cake is that we have a processing-tmp bucket, supposed to be temporary files, that is 2.63 TiB… And it’s only growing!

That’s not the definition of temporary in my book. And I’m not talking about a subtitle-tmp bucket with the same problem.

What’s important to look at is how fast those buckets are growing. On average for the first six months of 2023:

the processing-outputs bucket got, on average, one new TiB each month.
the “upload” bucket got, on average, 0.8 new TiB each month
the streaming bucket got, on average, 0.2 new TiB each month

Setup a battle plan

Pressing the delete all button is not a possible solution, sadly, so we have to clearly define the rules we want to orchestrate to get rid of this mountain of useless files.

We have different kinds of rules to set:

Direct delete. I shouldn’t have to say it in an article, but here we are. If you delete an entity linked to a file, you should also delete the file in the storage.
Product-defined rules. Some files become useless after some time based on a set of rules. Hence, we need to have an asynchronous task archiving or deleting them.
Number of reads. We deal with videos for products; those products go out of sale after some time, so they will not be played anymore. We still can’t delete them, and they might be accessed through the back office occasionally. So we want to change the storage class to pay less.

There are two things I need to explain before getting to our Roomba bucket cleaning implementation: Storage Class and Lifecycle rules.

What is a Storage class?

When I talk about “archiving”, I mean changing the storage class of the files so that they cost less money to store.

The storage class is a way of telling the cloud provider how frequent the file access will be. The less frequent it is, the cheaper it is to store. But to balance it out, you will pay more for each retrieval.

There is three main Storage class (which can differ based on the cloud provider so I don’t get into the details):

Standard: That’s the default one where you pay the most for storage and the lesser for file retrieval.
Infrequent access: Pays a little less for storage and a little more for retrieval.
Archive (Glacier): The storage cost is almost none, but you pay much more for retrieval. On AWS, there is also a delay in time to retrieve the files (12h or 24h)

What are Lifecycle Rules?

As the name says, cloud providers allow users to manage the life of their files through a set of rules. Those are configured by users directly onto the buckets and run automatically by the cloud provider.

You can, for example, change the Storage class after a defined period, or you can remove files that are of a special Storage Class automatically.

You can also implement rules based on custom metadata you set on the files for your use cases.

Let’s get rid of those files!

One example of files we want to archive is our application receives user video rushes, later edited into a final video. This process can take some back-and-forth between users and clients, but once the client validates the video, we won’t need the rushes anymore — but we still want to keep them for some months, just in case.

Our goal is to archive the videos for three months, remove the access from the back office for this period, and, if no user complains or asks for the rushes, delete them completely.

So, we created a CRON: a task that runs at a defined time (for example, every 30 minutes). We called it the archive cron — how unoriginal!

Sequence diagram of the archive-cron calling video API and cloud storage

This CRON will call our API to retrieve the list of videos that need to be archived. The API implements the product-defined rules; the CRON only knows the endpoint to call.

It will then set the storage class to “Archive” for all the files sent back from the API. It also sets a custom time metadata that we will target for deletion.

Once this is done and validated, it will remove the entity from the database to prevent any access from the back office, which would cost us a lot of money.

The important part is that we store a JSON file in another bucket containing all the information needed to roll back and restore the previous state between the file and the database if anything unexpected happens.

Finally, we configured a lifecycle rule on our buckets to delete any file that has been in the archive storage class for more than three months. The custom time metadata allows us to delete only the archive files tagged and not the other potential archives we might want to keep.

The battle is never without ambush.

The result we reached at the end is precisely what we’d hoped, but we clearly faced issues.

Wrong file, wrong file!

We archived the wrong files. OUPS!

That was bound to happen; our database used the same file URL in two different tables, and we were unaware of it. So when we archived the file and only removed it in one table. Every call to the video from the other table went on to cost way more than usual — Remember at the beginning when I said that all the developers who built the app were gone?

So we got a small surprise line in our billing report: “Archive Storage Europe Multi-region: 468.64€”. Nothing too crazy compared to what we were already paying, and we fixed it as soon as we saw the bills.

Data consistency

The second issue was data consistency in our database. The application has some years under its belt, and data has evolved.

We used the latest data, but what we didn’t know is that older videos were missing some crucial information, like the validation event log that we used to know if a video was validated and the date it happened.

So our cron would get stuck: “no more video to process” when almost nothing was processed.

We had to do some technical archeology to understand why. This has happened more than once.

Bucket size dropped significantly once the lifecycle rule hit

We nuked our buckets from useless files. How good does this graph look when your goal is to reduce the size of your buckets? Pretty good, I must say.

Do you want to know something terrible? After all this time spent cleaning our main buckets, we still did not take care of our fake temporary buckets with 2.63TB inside. *Starts creating the ticket to take care of that as soon as possible*

My mom used to say: “It’s easier never to let your room get dirty than having to clean it all once a month when I’m mad”, and my dumb child mind would never listen to her and start over a month later, but she was right — as always!

Thank you for reading this article until the end. If you liked it, please don’t hesitate to follow me on X (Twitter) or add me on Linkedin.