Rubric DiagnosisLLM Evaluation

Kappa 0.78. The rubric was broken.

The headline number looked fine. Here's how to find what it's hiding.

Say you're evaluating a customer service chatbot for a B2B SaaS company. Four dimensions: Correctness, Completeness, Tone, Helpfulness. 500 test cases. Two human raters.

The overall Kappa comes back 0.78. Substantial agreement on the Landis-Koch scale. The instinct is to call it done.

That instinct is wrong.

The first thing to check after Kappa is the distribution.

4% ones. 12% twos. 71% threes. 13% fours.

Seventy-one percent of all scores are 3s. That number tells you something the Kappa doesn't. When both raters default to 3 whenever they're uncertain, they agree constantly. That drives Kappa up. The correction for chance agreement is built into the formula, but it doesn't fully compensate when the imbalance is this severe.

So the question becomes: is 0.78 telling you the rubric works, or is it telling you that two raters found the same default?

You can't tell from Kappa alone. You need the distribution.

The second thing to check is the per-dimension breakdown.

The overall Kappa averages across all four dimensions. When you pull that apart:

Correctness:    0.82
Completeness:   0.79
Helpfulness:    0.83
Tone:           0.51

Three dimensions cluster between 0.79 and 0.83. Tone is at 0.51. That's moderate agreement at best.

Here's what that tells you. The overall Kappa of 0.78 was being held up by three well-performing dimensions. Tone was in trouble. The aggregate was hiding it.

If you ship this rubric on the basis of 0.78, you deploy a broken Tone dimension at scale.

Why Tone specifically?

A Kappa of 0.51 on a single dimension means two trained raters are disagreeing on roughly one in three scores. The most common cause is adjectival anchors. The rubric probably says something like: Score 4 is warm and professional. Score 3 is mostly professional. Score 2 is inconsistent.

What does "warm" mean to one rater versus another? One marks a direct, efficient response as professional. The other marks it as cold. Neither is wrong given the wording. The anchor is doing the work and it isn't enough.

The fix is behavioural anchors. Replace the adjective with an observable. Not "warm" but "acknowledges the user's frustration before moving to the solution." Not "inconsistent" but "switches register mid-response, formal in the opening, casual in the closing."

The test for whether an anchor is behavioural: can a new rater apply it without asking a clarifying question? If they have to interpret it, it's still adjectival.

The third question: what's causing the 71% clustering at 3?

Three hypotheses. They're not mutually exclusive.

Anchor compression. The Score 3 anchor is too wide. It absorbs responses that should be 2 and responses that should be 4. The middle is doing too much work.

Unreachable Score 4. The top anchor sets a standard nothing meets. Raters settle at 3 because it's the highest they can honestly give.

Homogeneous test set. The test cases are genuinely uniform in quality and 3 is the correct answer for most of them. This is a test set problem, not a rubric problem.

You can't distinguish these three from the numbers alone. You'd need to read the anchor wording or sample the test cases. But you can name which additional information you'd want before drawing a conclusion. In an interview, in a client review, that's the right move. Name what you know. Name what you don't. Name what you'd look for next.

What to do.

Rewrite the Tone anchors first, behavioural all the way through. Run a retest on 50 cases with the same two raters. If Tone Kappa comes up above 0.70, the anchors were the problem. If it doesn't move, investigate anchor compression or test set quality from there.

Don't ship until Tone clears 0.70. For a customer-facing system, the way the bot sounds is not a secondary concern.

The diagnostic stack.

When you get a Kappa, move through this in order:

What's the headline number? Don't commit to a verdict.
What's the score distribution? Is the agreement real or just clustering?
Which dimensions are pulling the average down? Disaggregate before drawing any conclusion.
What does disagreement look like when it happens? Adjacent scores (3 vs 2) suggest anchor compression. Distant scores (4 vs 1) suggest the dimension is conceptually broken.

Kappa is the entry point. Not the answer.

Two numbers. One question. →