AI-Assisted Code Review Saved Me Hours — Until It Confidently Broke My App

The Promise Was Real. So Was the Damage.

About three months ago I started routing every non-trivial piece of PHP I write through an AI review step before committing. The efficiency gains were immediate and genuinely impressive. Spotting a missing htmlspecialchars() call, catching an n+1 query I'd glazed over, suggesting a more readable way to chain array operations — Claude handled all of it faster than any colleague could, and without the social overhead of asking for a review on a Saturday afternoon.

Then it broke a file upload feature I'd spent two days hardening. Not dramatically. Quietly. It refactored a MIME-type check in a way that looked cleaner on the surface but silently bypassed a validation step for files without an extension. The change passed my eyes because the AI had explained it so convincingly. That explanation was the trap.

This post isn't a takedown of AI-assisted code review. I still use it every day. But I've developed a much more calibrated sense of when to trust the output and when to treat it as a confident-sounding first draft from a junior developer who has read a lot but deployed very little.

Where AI Code Review Is Genuinely Excellent

Let me be fair first, because the hype isn't entirely wrong. There are categories of feedback where AI consistently outperforms my own self-review:

Security surface scanning: It reliably catches unescaped output, direct user input passed to SQL or shell commands, missing CSRF token checks, and loose file permission assumptions. These are pattern-recognition tasks, and LLMs are pattern machines.
Readability and naming: It has no ego about your variable names. It will tell you $d is a bad name for a date object every single time.
Documentation gaps: Paste in a function with no docblock and ask what a new developer would misunderstand. The answers are usually accurate and humbling.
Catching obvious logical errors: Off-by-one errors, inverted conditionals, unreachable branches — the kind of thing you stop seeing after staring at the same function for an hour.

A study published in 2023 by researchers at Google and DeepMind found that LLM-assisted code review reduced the time developers spent on review tasks by roughly 20% while maintaining comparable defect detection rates for surface-level bugs. That tracks with my experience. The wins are real and consistent in low-ambiguity territory.

Where It Goes Wrong — and Why It's Hard to Catch

The problem isn't that AI gives bad advice. The problem is that bad advice and good advice arrive in the same confident, well-formatted tone. There's no hesitation in the output that signals you should slow down.

Here's a simplified version of what happened to my upload validator. My original check looked like this:

$finfo = new finfo(FILEINFO_MIME_TYPE);
$mime = $finfo->file($_FILES['upload']['tmp_name']);
$allowed = ['image/jpeg', 'image/png', 'image/webp'];

if (!in_array($mime, $allowed)) {
    throw new RuntimeException('Invalid file type.');
}

$ext = strtolower(pathinfo($_FILES['upload']['name'], PATHINFO_EXTENSION));
$allowed_ext = ['jpg', 'jpeg', 'png', 'webp'];

if (!in_array($ext, $allowed_ext)) {
    throw new RuntimeException('Invalid file extension.');
}

The AI suggested consolidating this into a single validation function to "reduce duplication." Its refactored version checked MIME type correctly — but its extension check used pathinfo() without accounting for filenames that had no dot, which returns an empty string. An empty string is not in $allowed_ext, so theoretically the check still fails. But in the refactored version, the extension check had been silently moved inside an if block that only ran when a specific condition was met. Files without extensions slipped through.

Was the AI wrong? Technically, it introduced a logical gap. But it framed the change as a simplification. And because I was in review mode — evaluating rather than writing — I read it the way you read a finished argument, looking for obvious flaws rather than tracing every execution path.

This is the core danger: AI output shifts you from a generative mindset to an evaluative one, and evaluative reading is shallower than generative writing. You're approving, not building.

The Categories of AI Review Mistakes I've Catalogued

After three months of deliberate use and post-mortems on the cases where I caught errors before they shipped, I've grouped AI code review failures into a few reliable buckets:

1. Context Blindness

The AI sees the function you paste. It doesn't see the calling code, the database schema, the cPanel environment quirks, or the business rule that explains why something that looks redundant is actually load-bearing. It will confidently remove "redundant" checks that exist because of a specific edge case you discovered six months ago and never commented.

Rule: Anything the AI flags as redundant gets a manual trace through the full call chain before removal.

2. Framework and Version Assumptions

I write vanilla PHP, often targeting environments where I don't control the PHP version. The AI frequently suggests functions or patterns that are clean and correct — in PHP 8.2. On a shared cPanel host running 7.4, they're syntax errors. It knows PHP 8.x idioms well. It's less reliable about what degrades gracefully.

Rule: Any suggested function gets a manual check against the PHP version in phpinfo() before I accept it.

3. Security Theater

This one is subtle. AI review often adds the appearance of security without the substance. It suggested I sanitize a value that was about to be passed to password_hash() — which is harmless but creates the false impression that sanitization is the right model for password handling, rather than hashing raw input. The code became harder to reason about, not safer.

Rule: Security suggestions get evaluated for whether they address the actual threat model, not just whether they pattern-match to "safe-looking code."

4. Confident Hallucination on Obscure APIs

Ask it to review code that interacts with a niche cPanel API endpoint or a specific payment gateway webhook format, and it may review it against an API version or behavior it has confidently reconstructed from incomplete training data. It won't tell you it's guessing.

Rule: Any review touching third-party APIs gets cross-referenced with the actual current documentation.

The Workflow That Actually Works for Me

I haven't abandoned AI code review. I've structured it so I get the benefits without the failure modes.

Use it first for the low-ambiguity pass: Security patterns, naming, documentation, obvious logic errors. Accept this feedback with low friction.
Treat structural refactoring suggestions as proposals, not answers: When it suggests reorganizing logic, I rewrite it myself from scratch using its suggestion as an intent statement — not a patch to apply. This forces me back into generative mode.
Always provide explicit context in the prompt: PHP version, hosting constraints, the business rule the function enforces. The more context, the less room for bad assumptions.
End every AI review session with a manual execution trace: Pick the two or three most significant changes and walk every input through every branch. Not a skim. An actual trace.
Keep a failure log: When the AI gets something wrong, I write down the category. This has made me much faster at recognizing the pattern in future sessions.

The Honest Assessment

AI-assisted code review is a productivity tool, not a safety net. Treating it as the latter is where developers get into trouble. The research that shows efficiency gains doesn't show equivalent gains in bug prevention for the subtle, contextual errors that actually cause incidents in production.

What it does is compress the time you spend on mechanical review tasks so you can spend more cognitive energy on the things only you can evaluate: the business logic, the deployment context, the edge cases you discovered the hard way.

The AI is a fast, well-read collaborator with no memory of your specific system and no skin in the game when it ships. Keep that picture in mind every time you read its output.

That framing has made me a better user of these tools. Not more skeptical across the board — that wastes the genuine value. More precise about where to slow down and verify, and more confident about where I can move fast.

The feature that got broken is now fixed, better commented, and has a test case specifically for the no-extension edge case. The AI indirectly made that code more robust. Just not in the way it intended.