We recently started a small project to clean up how parts of our systems communicate behind the scenes at Buffer.
A quick context: We use something called SQS (Amazon Simple Queue Service). These queues act like waiting rooms for tasks. One part of our system emits a message and another part records it later. Think of it like leaving a message to a colleague: “Hey, when you get a chance, process this data.” The system sending the note does not have to wait for a response.
Our project was to perform routine maintenance: updating the tools we use to test queues locally and cleaning up their configuration.
But while figuring out which queues we actually use, we found something we didn’t expect: seven different background processes (or cron jobs, which are scheduled tasks that run automatically) and workers that had been running silently for up to five years. They all do absolutely nothing useful.
Here’s why it’s important, how we found it, and what we did about it.
Why this is more important than you think
Yes, running unnecessary infrastructure costs money. I did a quick calculation and for one of these workers we would have paid about $360 to $600 over a five year period. That’s a modest amount in the grand scheme of our finances, but definitely a waste for a process that accomplishes nothing.
However, after doing this cleanup, I would argue that the financial cost is actually the smallest part of the problem.
Every time a new engineer joins the team and explores our systems, they come across these mysterious processes. “What is this worker doing?” becomes a question that eats up onboarding time and creates uncertainty. We’ve all been there: staring at a piece of code and being afraid to touch it because Perhaps it does something important.
Even “forgotten” infrastructure needs occasional attention. Security updates, dependency extensions, compatibility fixes when something else changes. This resulted in our team spending maintenance cycles on code paths that served no purpose.
And over time, institutional knowledge diminishes. Was that critical? Was it a temporary solution that became permanent? The person who created it left the company years ago, and the context left with him.
How can this even happen?
It’s easy to point fingers, but the truth is that this happens naturally in any long-lived system.
A function becomes deprecated, but the background job that supported it continues to run. Someone puts up a worker “temporarily” to handle a migration, and it never gets torn down. A scheduled task becomes redundant after an architecture change, but no one thinks to check it.
We used to send birthday party emails at Buffer. To do this, we ran a scheduled task that checked the entire database for birthdays that matched the current date and sent a personalized email to customers. During a redesign in 2020, we transitioned our transactional email tool, but forgot to remove this worker – it continued to run for five years.
This is not a matter of individual failures, but of process failures. Without intentional cleanup built into the way we work, entropy wins.
How our architecture helped us find it
Like many companies, Buffer joined the microservices movement (a popular approach in which companies split their code into many small, independent services) years ago.
We split our monolith into separate services, each with its own repository, deployment pipeline, and infrastructure. It made sense then: each service could be deployed individually, with clear boundaries between teams.
But over the years, we realized that the hassle of managing dozens of repositories outweighed the benefits for a team our size. That’s why we consolidated it into a single multi-service repository. The services still exist as logical boundaries but live together in one place.
This turned out to be what made the discovery possible.
In the world of microservices, each repository is its own island. A forgotten worker in one repo may never be noticed by engineers working in another repo. There is no single place to look for queue names and no unified view of what is running where.
With everything in one repository we could finally see the full picture. We were able to trace each queue back to its consumers and producers. We could see queues with producers but no consumers. We were able to find workers pointing to queues that no longer existed.
Consolidation wasn’t supposed to help us find zombie infrastructure – but it made that discovery almost inevitable.
What we actually did
After identifying the orphaned processes, we had to decide what we wanted to do with them. This is how we proceeded.
First, we traced each one back to its origin. We dug through Git history and legacy documentation to understand why each worker was created in the first place. In most cases, the original purpose was clear: a one-time data migration, a feature that was discontinued, a temporary workaround that outlived its usefulness.
Then we confirmed that they were indeed unused. Before we removed anything, we added logging to make sure these processes weren’t silently doing something important that we missed. We monitored for a few days to make sure they weren’t calling at all and then gradually removed them. We didn’t delete everything at once. We removed the processes one at a time and monitored for any unexpected side effects. (Fortunately there were none.)
Finally, we documented what we learned. We’ve added notes to our internal documents about what each process originally did and why it was removed, so future engineers don’t have to be surprised if something important is lost.
What changed after the cleanup
We’re still in the early stages of fully measuring the impact, but here’s what we’ve seen so far.
Our infrastructure inventory is now correct. When someone asks, “Which workers do we employ?” We can actually answer this question with confidence.
Onboarding conversations have also become easier. New engineers don’t stumble upon mysterious processes and wonder if they’re missing context. The codebase reflects what we actually do, not what we did five years ago.
Treat refactors as archeology and prevention
My biggest takeaway from this project: Every significant refactor is an opportunity for archaeology.
When you’re deep into a system and truly understand how the parts work together, you’re in the perfect position to question what else is needed. This queue from an old project? The worker someone created for a one-time data migration? The scheduled task that references a feature you’ve never heard of? They may still be running.
We will incorporate the following into our process in the future:
- With every refactoringask: What else does this system touch on that we haven’t looked at in a while?
- When a feature is deprecatedtrace it down to its background processes, not just the user-side code.
- When someone leaves the teamdocument what they were responsible for, especially the things that go on in the background.
We still have older parts of our codebase that have not yet been migrated to the single repository. As we continue to consolidate, we are confident we will find more of these hidden relics. But now we are prepared to trap them and prevent new ones from forming.
When all your code is in one place, the orphaned infrastructure has nowhere to hide.
Follow us on Facebook | Twitter | YouTube
WPAP (697)