Three years of a production SaaS accumulates a lot. Code written under deadline pressure that nobody went back to clean. Features layered on top of features. Tests that were meant to be written later. Business logic that lives in the head of the engineer who originally wrote it, not in the codebase.
Songplace and its companion platform Curator had all of this. The product worked. The business was growing. But the engineering team was spending a disproportionate amount of time on staging bugs, QA cycles that caught the same class of error repeatedly, and feature releases that required careful surgery rather than confident shipping.
One of our engineers took this on as a focused one-month project. This is what was done and what changed.
The starting point
Songplace is a music distribution and playlist management platform serving artists, labels, and curators. Curator is its companion tool for playlist curators to manage submissions, review tracks, and handle their inboxes at scale. Both platforms were live, had real users, and were generating revenue.
The codebase was approximately three years old. It had grown organically, which is another way of saying that the architecture decisions made in month one were still load-bearing in ways nobody fully understood. The specific symptoms that were slowing the team down:
- Feature releases required manual QA cycles that took 3-5 days and still let bugs through to staging
- Staging environment frequently diverged from production in ways that only surfaced after deployment
- No consistent test coverage, some modules had good tests, others had none, and there was no policy enforcing which
- Code patterns were inconsistent across the codebase because different engineers had written different sections at different times under different pressures
- PRDs existed but were informal, features got built from Slack conversations and Notion notes rather than structured specs
The result was a team that could ship, but could not ship fast. Every release felt like a careful negotiation with the codebase rather than a confident deployment.
Month one: what actually happened
| Week | Focus | Output |
|---|---|---|
| Week 1 | Audit and mapping | Module dependency map, 12 high-risk areas identified, refactoring priority list |
| Week 2 | Test-first coverage | Unit, integration, and edge case tests for all 12 high-risk modules |
| Week 3 | Systematic refactoring | Redundant queries fixed, error handling standardized, business logic relocated |
| Week 4 | Process and handover | Bi-weekly PRD rhythm, test-first release policy, CI enforcement |
Week 1: audit and map before touching anything
The first week was entirely diagnostic. No code was changed.
The engineer used Claude to systematically review every module and build a dependency map of the codebase. This was not a superficial audit. Every file was read and categorised: what does this do, what depends on it, what does it depend on, and where does the business logic live that is not documented anywhere.
This produced three outputs by the end of week one:
- A module map showing which parts of the codebase were stable and well-understood versus which were fragile and undocumented
- A list of the 12 highest-risk areas, places where a change was most likely to cause an unexpected regression
- A refactoring priority list ordered by impact vs risk
The AI learning loop started here. Instead of a single engineer making judgment calls about what to prioritise, every prioritisation decision was structured as: here is the module, here is its current state, here are the dependencies, here is the usage pattern, what should we address first and why. The output was documented, not just decided.
Week 2: test policy first, then refactoring
The most important decision of the whole project was made in week two: no code would be refactored until it had test coverage.
This sounds obvious. It almost never happens in practice because it feels like it slows you down. It is actually the thing that made the rest of the month possible.
The engineer wrote tests for the 12 high-risk modules identified in week one before touching a single line of production logic. For each module:
- Unit tests covering the core business logic
- Integration tests covering the interactions with adjacent modules
- Edge case tests specifically for the inputs that had historically caused bugs in staging
This was where AI-assisted development made a meaningful time difference. Writing tests is important but tedious work. Using Claude to generate test scaffolding from the existing code, then reviewing and refining those tests rather than writing them from scratch, compressed what would have been two weeks of test writing into four days. The tests were not generated blindly, every test was reviewed, every edge case was validated against real bug reports from the staging history.
By the end of week two, the 12 highest-risk modules had test coverage. Every subsequent refactoring change had a safety net.
Week 3: systematic refactoring with the AI loop
With test coverage in place, the refactoring sprint began. The process for each module:
- Run the existing tests, confirm they pass, this is the baseline
- Feed the module to Claude with the question: "given this code, this test suite, and this dependency map, what are the specific refactoring opportunities and what is the risk level of each?"
- Address the highest-value, lowest-risk improvements first
- Run tests after each change, if anything breaks, understand why before continuing
- Document the change and the reasoning in the commit message
The three biggest categories of improvement from week three:
Redundant database queries. Several modules were making 3-5 database calls for data that could be fetched in one query or cached. This was not visible in development but showed up as latency under real load. The refactoring brought these down to single queries with appropriate caching.
Inconsistent error handling. Different parts of the codebase had different conventions for how errors were caught, logged, and returned. Some used try/catch consistently, some swallowed errors silently, some had duplicated error handling that was slightly different in each location. Standardising this removed an entire class of hard-to-diagnose staging issues.
Business logic in the wrong layer. Validation logic that should have lived in the service layer was scattered across controllers and components. This meant the same validation was sometimes duplicated, sometimes missed. Moving it to the correct layer and writing tests for it specifically eliminated a category of bug that had appeared in staging repeatedly.
Week 4: PRD process, release cadence, and handover
The last week was not about code. It was about putting a process around the clean codebase so it stayed clean.
Two things were put in place:
Bi-weekly PRD rhythm. Every new feature now starts with a structured PRD written before any code is touched. The template covers: what problem this solves for which user, acceptance criteria expressed as testable conditions, edge cases that need to be handled, and a definition of done that includes test coverage. This is not bureaucracy, it is the thing that prevents the codebase from accumulating the same type of debt that the revamp just cleaned up.
Test-first release policy. No feature goes to staging without test coverage for its acceptance criteria. This is enforced as a CI check, not a code review suggestion. If the tests do not exist, the pipeline does not pass.
What changed
| Metric | Before | After |
|---|---|---|
| Release cadence | Monthly | Weekly |
| Staging bugs per release | 5-8 | Near zero |
| QA cycle time | 3-5 days manual | Automated, same day |
| Feature planning lead time | Ad-hoc | 2-week PRD cycle |
| Test coverage (high-risk modules) | Inconsistent | 100% |
Feature release cadence went from monthly to weekly. The team was previously releasing once a month because each release required an extensive manual QA cycle. With the test coverage in place and the inconsistent patterns removed, the automated test suite catches the things that manual QA was spending days looking for.
Staging bugs dropped to near zero. The class of errors that used to surface in staging, mostly from the error handling inconsistencies and the business logic in the wrong layer, stopped appearing. Staging now behaves like production because the code is structured consistently rather than organically.
Feature planning became predictable. The bi-weekly PRD process means the engineering team knows what is coming two weeks out and has clear acceptance criteria for every feature.
The roadmap is in control. When you cannot ship confidently, the roadmap is aspirational. When you can ship weekly with predictable quality, the roadmap becomes a plan you execute against.
What the AI-assisted approach actually contributed
It is worth being specific about this because "AI-assisted development" can mean anything from "I used GitHub Copilot for autocomplete" to claims that are significantly overstated.
| AI-Assisted Task | Time Saved | What Still Required a Human |
|---|---|---|
| Test scaffolding from existing code | ~60% faster | Reviewing edge cases, validating against real bugs |
| Module-by-module code review | Surfaced issues a manual review would miss at scale | Prioritisation, product context, risk assessment |
| Documentation generation | ~70% faster | Accuracy review, adding context not in code |
| Refactoring suggestions | Specific, justified, and testable | Engineering judgment on what to prioritise |
Test scaffolding. Generating the initial structure of test files from existing code saved significant time. The engineer still reviewed and refined every test. But starting from a scaffold rather than a blank file changed the speed of the test-writing phase meaningfully.
Code review at scale. No single engineer can hold an entire three-year codebase in their head. Using Claude to review individual modules with specific questions surfaced issues that a manual code review at that scale would have missed.
Documentation generation. Every refactored module now has inline documentation that describes what it does, what its dependencies are, and what edge cases it handles.
Refactoring suggestions with justification. The value was not "rewrite this." The value was "here are three specific ways this could be improved, here is why each one matters, here is the risk of each change." That framing kept every change justified and traceable.
What AI assistance did not replace: engineering judgment about what to prioritise, understanding of the product context that determines which edge cases actually matter, and the code review process that ensures the output is correct and appropriate.
The broader point
Technical debt in a SaaS product is not a moral failing. It is the natural result of building something real under real constraints. The question is whether you address it proactively or reactively.
The reactive version looks like: bugs reach production, staging becomes unreliable, engineers spend more time firefighting than building, feature velocity drops, the team gets frustrated, and eventually a rewrite is proposed that takes six months and half-ships.
The proactive version looks like: one month, one engineer, a structured approach, and a codebase that comes out the other end with test coverage, consistent patterns, and a process that keeps it that way.
The AI-assisted approach did not make the work effortless. It made a month-long project feasible for one engineer that would otherwise have taken a team of three significantly longer.
If your SaaS has accumulated the kind of debt that is starting to slow down your team, book a free 30-minute call. We can usually tell you within that call whether a focused revamp sprint makes sense for your codebase.