Refactoring is for Compliance, Due Diligence, and Safety

The Midnight Hotfix

It's late in the evening. The new release requires application downtime, and the maintenance window is ticking. The team has worked hard for this release — weeks of development, testing, coordination. The deployment goes through. The application comes back online.

Then a bug surfaces. Customers are affected. Management is watching. Rolling back would mean retracting the release — admitting the version wasn't ready, wasting the maintenance window, and scrambling to schedule another one. That's not happening.

A developer writes the hotfix. A pull request is created. Another developer — tired, under pressure, wanting this to be over — approves it within minutes. It ships. The release stays on track.

The reviewer glanced at the diff, saw that it looked reasonable, trusted the author, and approved. In a complex, unrefactored codebase, even a well-intentioned reviewer cannot meaningfully evaluate a hotfix against tangled, sprawling code — not at this hour, not under this kind of pressure. There are too many variables, too many side effects hiding in oversized files, too many implicit dependencies that no diff can surface.

This is rubber-stamping — approval as a throughput gesture: a quick scan, no real challenge, no attempt to trace impact. It is not unique to post-release hotfixes. It happens in everyday development — large pull requests approved in minutes, complex changes waved through because the team trusts the author or simply has other things to do. But a hotfix pushed at midnight to save a release is where nobody can pretend it didn't happen.

This is the moment the review process exists for — catching errors before they compound — and it is the moment it most consistently fails.

Software errors have consequences — a data leak, a billing mistake that compounds for months, a miscalculation in a safety-critical system. When those consequences are severe enough, companies end up in courtrooms. And when they do, someone will pull up the git history, the PR timestamps, and the approval records.

What happens when the pattern of rubber-stamping code reviews is exposed?

Most incidents never see a trial, but they do trigger investigations — internal, insurer-led, regulatory, or contractual.

Beyond Code Quality

In a previous article, I argued that Refactoring aligns code with how the human brain works — chunking, working memory limits, recognition over recall. Refactoring makes code easier to read, reason about, and change because it respects the cognitive constraints of the people who maintain it.

But those same cognitive constraints have another dimension. When code is too complex for a reviewer to genuinely understand, the review process breaks down. And when the review process breaks down, organizations lose something they often don't realize they need until it's too late: a defensible record that someone actually checked the work.

Code reviews and audits exist for a reason. Software errors cost money — financial losses, breach remediation, regulatory fines, litigation. Software errors cost lives — medical devices, aviation, industrial control systems. Reviews are the checkpoint where someone other than the author is supposed to catch mistakes before they reach the people who will pay for them. A different person brings a different perspective — they don't share the author's assumptions, they aren't blind to the same edge cases, and they will read the code as it is written, not as it was intended.

Most organizations believe they are protected because they have a review process in place and pass their audits. But having a process and having an effective process are not the same thing. When something goes wrong and a company faces a lawsuit or regulatory action, it is the depth of that review — not its mere existence — that gets scrutinized. Code that is too complex to meaningfully review is code that creates liability.

Code Review as a Control Expectation

No single law says "you must do code reviews." But across major compliance regimes and assurance frameworks, a common pattern emerges: changes to software must be authorized, evidenced, and independently evaluated to a level commensurate with risk. Code review is one of the most common ways organizations implement that expectation — often alongside automated controls.

The specifics vary by framework. PCI DSS explicitly calls for review of custom code changes by someone other than the author before release to production. FDA software validation guidance and IEC 62304 emphasize independence of review for medical device software — the kind of software where a defect can injure or kill. Other frameworks — SOC 2, NIST SP 800-53 — don't prescribe code review specifically but expect change management controls that most organizations satisfy through peer review.

Then there is SOX. The Sarbanes-Oxley Act does not mandate code review directly, but the internal controls it requires — change management, approvals, segregation of duties, evidence retention — create IT general control programs where peer review becomes the standard implementation.

Across these frameworks, the control objective is the same: independent evaluation of changes commensurate with risk, with evidence to prove it happened.

The Audit Gap

Audits are sampling-based and evidence-driven. They can validate that a control exists and that it operates consistently, but they rarely evaluate whether each reviewed change was meaningfully understood. A SOC 2 Type II audit, for example, tests operating effectiveness through sampling — not just existence — and that is a meaningful level of scrutiny.

But even effectiveness sampling typically checks for artifacts. Was a different person listed as approver? Was there a linked change ticket? Did CI pass? What it does not check is whether the reviewer's cognitive engagement matched the complexity of the change. A 2,000-line diff approved in four minutes produces the same audit artifact as a 50-line diff reviewed over an hour with comments and questions.

The gap is between what sampling can reasonably detect and what meaningful review actually requires. Organizations can pass audits while routinely conducting reviews that satisfy the evidence requirements but provide limited genuine oversight. "We passed the audit" becomes confused with "our review process is effective." These are not the same thing.

The Rubber Stamp as Industry Norm

The conditions that produce rubber-stamping are everywhere. A widely cited industrial study — 2,500 reviews across 3.2 million lines of code at Cisco — found that review quality drops sharply beyond 200–400 lines of code, and that reviewers scanning faster than 500 lines per hour miss a significant percentage of defects. After 60 minutes, diminishing returns become severe.

Most teams do not work in 200-line pull requests. The midnight hotfix from the opening is the extreme case, but the same dynamic plays out on a Tuesday afternoon with a routine feature branch. The reviewer has their own work to do. The PR is big. The code is unfamiliar. Approving it takes two clicks. Understanding it takes an hour. So they scan the file names, maybe read a few lines, and approve.

Not every team operates this way. Teams with deep domain familiarity, small change sizes, and a culture of thorough review do exist — Google, for instance, has a median of 24 lines modified per change and completes reviews in under 4 hours. But these teams have engineered their process around human cognitive limits, whether they frame it that way or not. Where the conditions for meaningful review are absent — large diffs, time pressure, unfamiliar code — rubber-stamping is the predictable outcome. And audits won't flag it.

Three Types of Exposure

When things go wrong, the consequences fall into three categories — distinct but overlapping.

Compliance exposure is the most familiar. Failure to satisfy framework requirements — PCI DSS, SOC 2, SOX ITGC — results in audit findings, failed attestations, and loss of certifications. This is the world of auditors and remediation plans.

Regulatory exposure carries the force of law. Violations of FDA requirements, SOX statutory obligations, or sector-specific regulations result in enforcement actions, fines, product restrictions, and in some cases personal liability for executives. This is not a failed checklist — it is a government agency with enforcement power.

Liability exposure is the one most organizations think about least until they face it. Often this shows up as contract claims — SLA breaches, indemnity disputes, limitation of liability arguments. But negligence theories also appear in high-harm events, where a plaintiff does not need to prove you violated a specific regulation. They need to prove you didn't take reasonable care, and that your failure caused harm.

These three overlap but are not identical. Passing an audit is persuasive evidence of reasonable care — but it is not dispositive. An organization can be compliance-compliant and still face liability exposure if a plaintiff demonstrates that compliance was superficial — that the controls existed on paper but didn't function in practice. In regulated industries and in high-damages incidents like data breaches or financial system failures, how software was developed and reviewed can become central to the "reasonable care" narrative.

The Cognitive Case Against Rubber Stamping

In Refactoring is for Humans, I described how chunking, working memory limits, and cognitive load shape the way programmers write and read code. These same constraints apply to reviewers — arguably more so, because a reviewer is working with someone else's mental model, not their own.

The research puts hard numbers on it. The Cisco study found that review effectiveness peaks at 200–400 lines and drops sharply beyond that. Sessions longer than 60–90 minutes yield diminishing returns. Reviewers processing more than 500 lines per hour miss a significant percentage of defects. These are not guidelines — they are measured limits of human cognition applied to code.

The problem is not file size alone — it is code structure. Long methods, oversized classes, tangled dependencies. A change that touches a poorly structured codebase forces the reviewer to trace through layers of implicit coupling to understand what's actually affected. That work is doable, but it takes time and effort that multiply with complexity. Under time pressure — the midnight hotfix, the end-of-sprint merge — stress and urgency further degrade cognitive performance. The reviewer is not being lazy. They are hitting a wall that no amount of diligence can overcome.

Rubber-stamping is not a character flaw. It is a predictable, systematically incentivized outcome when code complexity exceeds human cognitive capacity to review.

Teams can mitigate this with compensating controls — automated tests, CI gates, static analysis, staged rollouts, feature flags. These help, and responsible teams use them. But compensating controls supplement review — they don't replace the regulatory and framework expectation of independent human evaluation. Their presence doesn't eliminate the argument that the review itself was inadequate.

Pair programming is the exception. It is the strongest form of independent code review available — two people reasoning through the code together in real time, catching issues as they emerge rather than after the fact. The independence requirement is satisfied at its core: a second person evaluates every line as it is written, with full context, not scanning a diff hours or days later. Major compliance frameworks accept pairing as independent review, and for good reason — it produces deeper scrutiny than any asynchronous review ever could. Where pair programming is practiced, the compliance case is the strongest it can be.

How a Plaintiff's Lawyer Could Build the Case

This is not a guaranteed playbook — outcomes depend on the industry, the claim, the nature of harm, and what evidence is discoverable. But the evidentiary strategy is straightforward, and the tools are already there.

Version control systems record everything. A plaintiff's lawyer may seek PR approval timestamps, diff sizes, and reviewer activity in discovery — subject to retention policies, relevance, and protective orders. They can demonstrate a pattern: thousands of lines approved in minutes, repeatedly, across months or years.

Expert witnesses make the pattern legible to a jury. A software engineering expert explains what a meaningful review looks like and how long it should take. A cognitive psychologist explains working memory limits and what happens to attention under time pressure. The Cisco study provides the benchmark. The git history provides the evidence.

The narrative writes itself, and it is simple enough for any jury to follow: "They were supposed to check the work. The evidence shows they couldn't have. And now my client was harmed."

How Refactoring Closes the Liability Gap

Smaller, cohesive classes and methods produce smaller, focused pull requests. A change that maps to a single concern fits within what a reviewer can genuinely hold in working memory. Well-named classes and methods communicate intent — the reviewer can leverage recognition over recall, reading the code as a narrative rather than deciphering a puzzle.

In a well-structured codebase, even an urgent fix can be reviewed meaningfully because the scope is contained and comprehensible. The fix touches one method in one class with a clear responsibility. The reviewer can see what changed, understand why, and evaluate whether it's correct — in minutes, not because they're cutting corners, but because comprehension is fast when the code is clear.

Review time goes down. Review quality goes up. The pull request becomes a genuine checkpoint rather than a formality.

What does defensible review evidence look like? Reviewer comments or questions — even brief ones. A completed review checklist. CI results linked to the change. Security scan results. A change ticket tying the code to a requirement. For emergency changes, retrospective review documentation. None of this is onerous when the change is small enough to actually understand. One caveat: if reviewers raise risks, track them to closure — a ticket, a mitigation, a follow-up. Review artifacts help most when they show risks being resolved; they hurt when they show risks being identified and ignored.

Your audit trail shifts from "someone approved it" to "someone could have and likely did understand it." That is a fundamentally different position to defend.

Refactoring Under Change Control

An important caveat: refactoring itself is a change. In regulated environments — especially under FDA/IEC 62304 — any change requires impact analysis and regression testing. Refactoring does not get a free pass.

But properly done refactoring is not a separate initiative. It is continuous — part of every commit. When you fix a bug, you leave the code around it a little cleaner. When you add a feature, you improve the structure you're working in. When you change existing behavior, you clarify what was unclear. The Boy Scout Rule — leave the campground cleaner than you found it — applied to every commit. Standalone refactoring initiatives — dedicated projects or sprints to overhaul structure — should be the exception, not the routine. The individual refactoring modifications themselves are best kept in separate, granular commits: one simplification at a time. Small commits are easy to review, easy to revert, and produce a clear audit trail of what changed and why.

This is where automated testing matters most. You can refactor with manual testing, but it is slower, more expensive, and harder to sustain as a continuous practice. With automated tests in place, the equation changes. The test suite serves double duty: it makes refactoring safe by continuously verifying that behavior is preserved, and it generates compliance evidence as a natural byproduct. CI results, test coverage reports, pass/fail records — these become audit artifacts produced automatically, not as additional compliance overhead.

When a reviewer sees a pull request with incremental structural improvements alongside a bug fix or feature — and all tests passing — they can evaluate the whole change with confidence. The refactoring makes the functional change easier to understand, and the tests prove nothing broke. This is a reviewable change.

Refactoring is a compliance investment, not a free action. But with proper test automation, it pays compounding dividends: each commit leaves the codebase a little more reviewable, each review a little more meaningful, and the compliance evidence a little more robust.

The Git History Dimension

A well-structured codebase produces a readable git history. When each file has a clear responsibility, its commit log tells a coherent story — why it changed, when, and in response to what requirement. Tracing from a user story or regulatory requirement to the code that implements it becomes natural.

A monolithic codebase produces noise. When unrelated concerns live in the same file, its history is a tangle of interleaved changes that obscure intent. Finding out why a particular line changed — and whether that change was reviewed and tested — means sifting through commits that touched dozens of things at once.

A clean history strengthens both the compliance audit trail and a legal defense. It is the difference between being able to show exactly what changed and why, and hoping no one looks too closely.

Refactoring as Risk Mitigation

Refactoring is usually framed as a code quality practice. It is that, but it is also evidence of reasonable care.

Well-structured code doesn't just reduce bugs — it makes the entire review and oversight process defensible. The chain is straightforward: refactoring produces reviewable code. Reviewable code gets genuinely reviewed. Genuine reviews produce a defensible audit trail. A defensible audit trail reduces exposure across compliance, regulatory, and liability dimensions.

A cleaner codebase is not immunity. It doesn't guarantee outcomes. But it demonstrates diligence — that the organization invested in making its software comprehensible, its reviews meaningful, and its processes more than ceremonial. That is a fundamentally stronger position than hoping the audit artifacts hold up under scrutiny.

Refactoring is not a substitute for safety engineering; it's a force multiplier for the controls that depend on human understanding.

Compliance is Not the Point

Think back to the evening of the release. The maintenance window. The tired developer approving the hotfix. The pressure to keep moving forward.

Compliance and review processes are not ends in themselves. They exist because software errors cost money — financial losses, breach remediation, regulatory fines, litigation. They exist because software errors cost lives — medical devices, aviation, safety-critical systems. The regulations, the reviews, the audit trails — all of it is machinery built to prevent real harm to real people.

When that machinery is reduced to rubber stamps and checkbox processes — especially in the moments that matter most — it fails its purpose.

Refactoring makes code reviewable. Reviewable code gets genuinely reviewed, even under pressure. Genuine review catches errors. Fewer errors mean fewer people harmed.

Compliance is the framework. Liability is the risk. Prevention of harm is the purpose. Refactoring serves all three.

Disclaimer

The author is a software engineering practitioner, not a lawyer or psychologist. Legal and cognitive science concepts are discussed from a practitioner's perspective to provoke thought and discussion. This is educational information, not legal advice — a disclaimer does not eliminate regulatory obligations or legal exposure. Consult appropriate professionals for specific compliance or liability questions.

References

Regulatory & Legal

1. PCI-DSS v4.0 — Payment Card Industry Data Security Standard. Requirement 6.2.3 addresses code review for custom software prior to release.
https://www.pcisecuritystandards.org/document_library/

2. FDA General Principles of Software Validation — Final guidance on software validation for medical devices, including independence of review and change control.
https://www.fda.gov/regulatory-information/search-fda-guidance-documents/general-principles-software-validation

3. IEC 62304 — Medical device software lifecycle processes. FDA-recognized consensus standard for software verification, including code review.
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfStandards/results.cfm

4. NIST SP 800-53, Rev. 5 — Security and Privacy Controls for Information Systems and Organizations. CM-3: Configuration change control.
https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final

5. NIST SP 800-218 (SSDF) — Secure Software Development Framework. Practice PW.6 explicitly addresses code review as a secure development practice.
https://csrc.nist.gov/pubs/sp/800/218/final

6. PCAOB Auditing Standard No. 5 — An Audit of Internal Control Over Financial Reporting That Is Integrated with An Audit of Financial Statements.
https://pcaobus.org/oversight/standards/archived-standards/pre-reorganized-auditing-standards-interpretations/details/Auditing_Standard_5

7. Sarbanes-Oxley Act, Section 404 — Internal controls over financial reporting, segregation of duties.
https://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act#Background

8. Negligence Per Se — Cornell Law Institute definition and legal framework.
https://www.law.cornell.edu/wex/negligence_per_se

Code Review Research

9. SmartBear / Cisco Systems Code Review Study — Widely cited industrial study: 2,500 reviews across 3.2 million lines of code. Key findings: 200–400 LOC optimal review size, effectiveness drops above 500 LOC/hour, sessions should not exceed 60–90 minutes.
https://static0.smartbear.co/support/media/resources/cc/book/code-review-cisco-case-study.pdf

10. Sadowski, C. et al. (2018) — "Modern Code Review: A Case Study at Google." ICSE-SEIP 2018. Analysis of 9 million reviewed changes. Median lines modified per change: 24. Median review latency: under 4 hours.
https://www.sback.it/publications/icse2018seip.pdf

11. Bacchelli, A. & Bird, C. (2013) — "Expectations, Outcomes, and Challenges of Modern Code Review." ICSE 2013. Survey of 873 programmers at Microsoft. Less than 15% of review comments address bugs; primary value is knowledge transfer.
https://sback.it/publications/icse2013.pdf

12. Dr. Michaela Greiler — "How Code Reviews Work at Microsoft" — Developer time spent on reviews, tooling, and process.
https://www.michaelagreiler.com/code-reviews-at-microsoft-how-to-code-review-at-a-large-software-company/

13. Dr. Michaela Greiler — "Code Reviews at Google are Lightweight and Fast" — Summary of Google's review practices and metrics.
https://www.michaelagreiler.com/code-reviews-at-google/

14. Meta Engineering (2022) — "Move Faster, Wait Less: Improving Code Review Time at Meta" — P75 review time as north-star metric.
https://engineering.fb.com/2022/11/16/culture/meta-code-review-time-improving/

Cognitive Science

15. Miller, G.A. (1956) — "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information." Psychological Review, 63(2), 81–97.

16. Cowan, N. (2001) — "The Magical Number 4 in Short-Term Memory: A Reconsideration of Mental Storage Capacity." Behavioral and Brain Sciences, 24(1), 87–114.
https://pubmed.ncbi.nlm.nih.gov/11515286/

17. Cowan, N. (2010) — "The Magical Mystery Four: How is Working Memory Capacity Limited, and Why?" Current Directions in Psychological Science, 19(1), 51–57.
https://pmc.ncbi.nlm.nih.gov/articles/PMC2864034/