Refactoring a System While It's Running
Production systems can't be taken offline for a rewrite. Here's how I approach making significant structural changes without breaking things users depend on.
Production systems don't pause so you can fix them. Users are in the middle of something. Data is being written. Background jobs are running. And yet the code needs to change — sometimes significantly.
I've done a lot of this work over the past few years. Not because I enjoy complexity, but because the systems I build actually get used, which means I can't treat refactoring the way a tutorial does. There's no "stop the world, rewrite everything, redeploy." You have to thread the needle while it's spinning.
Here are the patterns I reach for.
Make the schema change additive first
The safest database migration is one that doesn't remove anything. If I need to change how a table works, I add the new column or table first, populate it, and only remove the old structure after the new one is fully trusted.
This sounds obvious, but it has real teeth. When I added soft delete to a task management system, I didn't rip out the existing delete behavior. I added a nullable deletedAt timestamp, wrote a Prisma extension that filters deleted records automatically at the query layer, and let the application code stay almost entirely unchanged. Reads got the new behavior for free. The old hard-delete path still worked during the transition period. Only after I was confident everything was filtered correctly did I go back and wire up the actual delete UI to set the timestamp instead of destroying the row.
The pattern: expand the schema, migrate the behavior, contract the old path. Never contract first.
Run old and new in parallel, with fallback
When the change is more than schema-deep — when you're replacing a whole architectural pattern — you often need both systems alive at once.
I recently replaced how a music catalog app handles playlist editing. The old approach was RSuite modals: open a modal, edit inside it, close. The new approach is an always-mounted sheet provider: the editing surface lives in the component tree at all times, toggled visible by state, never mounted or unmounted mid-session. These are fundamentally different architectures. You can't migrate one component at a time without the app being in an incoherent state.
The way I handled it: build the new provider architecture completely, get it working in isolation, then flip the callsites over in a single branch. During that transition branch, both patterns existed in the codebase. The old modal code was still there, still functional, but no longer being invoked. Once the new strip-and-sheet pattern was confirmed stable, I removed the old modal code entirely.
The key discipline here is keeping the fallback actually working. It's tempting to let the old code rot during the transition. Don't. If something goes wrong with the new approach, you need to be able to revert without also having to fix broken old code.
Use a flag to switch modes
Some migrations are best done with an explicit on/off switch at the infrastructure level.
I have a living document system that generates briefing files from a mix of sources: Google Drive API calls, synthesized context, state files. The delivery mechanism was originally Drive-only. When I built an alternative delivery path — reading from a local state file instead of hitting the API — I didn't swap it in-place. I added a mode.conf file that the service reads at startup. Set it to drive, it uses the old path. Set it to state_file, it uses the new one.
This isn't feature flags in the product-management sense. There's no dashboard, no user segmentation. It's just an explicit, human-readable switch that makes the active behavior obvious and the transition reversible. When the state-file path was fully validated, I flipped the conf and left the Drive path in place for another few weeks before removing it.
The advantage over a code-level boolean is that you can flip it without a deploy. The advantage over a full A/B system is that it's simple enough to reason about without tooling.
Extract blocking work into a separate process
Sometimes refactoring isn't about the data model or the UI — it's about execution model.
A synthesis job I run was originally inline: a FastAPI endpoint received a request, ran the synthesis, and returned. Fine for low volume. But as the job got heavier, it started blocking the event loop. Other requests were stalling.
The refactor moved synthesis into a dedicated systemd worker service. The API endpoint now enqueues a job and returns immediately. The worker picks it up, runs it, and writes the result. The API later serves the result from disk.
This is a significant architectural change — request/response becomes enqueue/poll — but the migration path was clean. I built the worker first, confirmed it worked independently, then changed the API endpoint to use it while keeping the response contract the same. From the outside, the API looked identical. Internally, the work had moved.
The lesson: if you can keep the external contract stable, you have a lot of freedom to change what's happening internally. Clients don't care how you synthesize the result, only that it comes back in the shape they expect.
Accept the transition tax
Every one of these patterns has a cost. During a parallel-run migration, the codebase has more surface area than it should. You're maintaining two things. You have to document what's transitional so future-you doesn't get confused. Cleanup debt accumulates until you close the loop.
I've started treating that debt explicitly. When I open a migration branch, I create a note — sometimes just a comment, sometimes a task — that records what needs to be removed when the transition is complete. The cleanup is a first-class deliverable, not an afterthought.
The alternative — waiting until you can do a clean rewrite with no users depending on the system — is mostly a fantasy. Systems that get used get messy. The question is whether the mess is intentional and managed, or accidental and growing.
Intentional mess you can clean up. The other kind just compounds.