LLM JudgeEval Architecture

Two numbers. One question.

When human-human agreement and human-LLM agreement diverge, the gap tells you exactly what to fix.

Take this scenario. A code review assistant for backend engineers. It reviews pull requests and flags bugs, security vulnerabilities, style problems, and performance concerns. Four dimensions: Bug Detection, False Positive Rate, Explanation Quality, Severity Calibration. 600 pull requests scored by two senior engineers and one LLM judge.

The numbers come back:

Human-human Kappa:  0.74
Human-LLM Kappa:    0.42

That pair is the whole story. Everything else is the explanation.

Why the pair matters.

A human-human Kappa of 0.74 is substantial. Two senior engineers looking at the same pull requests, reaching similar conclusions after correcting for chance. The rubric is doing its job at the human level.

A human-LLM Kappa of 0.42 is moderate at best. Below what you'd want for a system informing senior engineering decisions. The gap, 0.32 points, means the LLM judge is interpreting the rubric differently from the humans.

This is not a rubric problem. The rubric works for humans. It's a judge prompt problem.

Check the distribution first.

22% ones, 28% twos, 27% threes, 23% fours.

Well-spread. No clustering. That rules out the class imbalance trap from the previous article. The rubric is discriminating properly. The problem isn't the scale. The problem is translation between the rubric and the judge.

The per-dimension breakdown.

                       Human-human   Human-LLM
Bug Detection:           0.81         0.71
False Positive Rate:     0.79         0.38
Explanation Quality:     0.73         0.51
Severity Calibration:    0.62         0.29

Severity Calibration and False Positive Rate have the worst drops. Bug Detection is relatively stable.

Here's why that pattern makes sense. Bug Detection is relatively objective. A bug is either present or it isn't. The LLM can pattern-match against this with reasonable accuracy. Severity Calibration requires contextual judgment: how bad is this bug given the specific system, the deployment environment, the user impact? False Positive Rate requires assessing not just whether a flag was raised, but whether it was warranted.

The LLM is not reading use-case context the way the humans are. It's applying general principles. The humans are applying domain judgment.

The fix direction.

The judge prompt is the lever. Three things to address.

First, the criteria. If the rubric anchor says "flag only issues that would cause functional failures," the judge needs to know what functional failure means in the context of a backend code review system specifically. That's not self-evident. Write it in.

Second, few-shot examples. For Severity Calibration, show the judge two or three pull request examples with scores and reasoning. One that's clearly a 4 (critical bug, production impact). One that's a 2 (style issue, no functional risk). One that's the hard case, a security vulnerability in an unused function. Show the reasoning, not just the score.

Third, the use-case grounding paragraph. Before the rubric, add two sentences about what this system is for and who uses the output. "These reviews inform senior engineers' decisions about whether to merge. A false positive wastes a senior engineer's time. A missed severity-1 bug reaches production." That changes how the judge weights its uncertainty.

The deployment question.

0.42 is below any reasonable production floor for a business-critical system. For a code review tool used by senior engineers, you'd want at least 0.75 before deploying the LLM judge at scale. Ideally 0.85, given the stakes.

0.42 means the LLM is wrong roughly as often as it's right on the dimensions that matter most. That's not evaluation. That's noise with extra steps.

Fix the judge prompt. Retest on a fresh sample. If Severity Calibration and False Positive Rate don't clear 0.70, those specific dimensions may need to go to human review rather than the LLM judge.

The one rule that holds across both articles.

When you get a Kappa, ask whether it's measuring agreement or just clustering. Disaggregate. Find where the problem concentrates. Ask what's causing it. Because "the rubric is broken" and "the judge prompt is wrong" are different diagnoses with different fixes.

Two numbers. One question. Everything else follows.