Task EfficacyEvaluation Foundations

You can't evaluate what you haven't defined.

Most teams start building the eval before they've written a north star sentence. Here's the sequence that actually works.

Most teams build the eval before they've written a north star sentence. They pick dimensions. They write anchors. They run the judge. The scores come back. They ship.

Six months later, the users are struggling and nobody knows why. The scores were fine. The task wasn't working.

The problem is almost always the same. Task completion was never defined precisely enough to measure.

The question that almost never gets asked.

What would success actually look like, from the user's perspective, at the end of this interaction?

Not what the system outputs. What the user is now able to do that they couldn't do before.

A well-written response that answers the wrong question scores well on quality dimensions and fails the task. A thorough answer that requires three follow-up questions to act on is not a complete answer. A technically correct recommendation the user can't apply to their specific situation is not an actionable answer.

The question is not: is this response good? The question is: did the user accomplish what they came to do?

Those are different questions. They have different answers.

The six things you need before you write a single test case.

What does the user arrive with? Not what you assume. What information, context, or problem are they actually bringing?

What do they need to accomplish? Not what they ask for literally. What they actually need. These often differ.

What would a successful output contain? Not what high quality looks like in the abstract. What specific elements need to be present for this user to take their next step.

What would constitute failure despite surface correctness? The ways a technically correct response still fails the task. Correct diagnosis, wrong fix. Right answer, wrong level of detail. Accurate information, wrong context.

What observable signal tells you completion happened? What can you actually track? No follow-up question on the same topic. User applies the output without requesting revision. The downstream action was taken.

What are the task's failure modes? The specific ways this task breaks in production. Not failure modes in general. For this task type, with this user population, in this system.

Until you have answers to all six, you don't have a task definition. You have an intent. You can't evaluate an intent.

Task type determines everything downstream.

There are six task types. Each has different completion criteria, different primary failure modes, and a different evaluation architecture.

Information retrieval: did the user get the right, current, contextually relevant information? The primary failure is technically accurate but outdated or decontextualised information.

Analytical and reasoning: did the system reach the right conclusion through valid reasoning? The primary failure is plausible-sounding conclusions built on flawed logic.

Generative and creative: did the output do what it was created to do? The primary failure is well-formed output that doesn't serve its purpose.

Instructional and procedural: can the user follow the instructions and succeed? The primary failure is a missing step.

Decision support: does the user have what they need to make an informed decision? The primary failure is biased framing or hidden trade-offs.

Agentic: was the intended real-world outcome achieved, without unintended side effects? The primary failure is irreversible actions taken with insufficient confidence.

Each of these needs different dimensions, different weights, different anchors, different signals for verifying completion. If you don't identify the task type before writing dimensions, you're borrowing from the wrong architecture.

The most dangerous quadrant.

Task completion and task quality are independent. A system can always complete the task and complete it badly. A system can sometimes complete the task and always do it well.

The most dangerous quadrant is high completion, low quality. The system appears to work. Standard metrics look acceptable. Users gradually lose trust without being able to articulate why. The dashboard shows no problem. The users are already leaving.

This is harder to detect than low completion. When tasks don't complete, people tell you. When tasks complete poorly, people just stop coming back.

The way to detect it: run completion and quality as separate measurements. Completion is binary. Did this task get done? Quality is continuous. How well was it done? Keep them separate. Don't let one mask the other.

Goodhart's Law, applied here.

When task completion rate becomes the optimisation target, the system finds ways to appear to complete tasks.

A customer service system that closes every ticket by issuing a refund has a 100% resolution rate. It has poor task efficacy. The underlying problem was never solved. The user will be back.

The fix is pairing rubric scores with real-world signals that can't be gamed by the system. Follow-up rates. Re-contact rates. Downstream action rates. User-reported outcomes. When your rubric scores and your real-world signals diverge, the rubric is drifting from reality. Fix the rubric.

The sequence.

Define the task. Write the north star sentence. List the failure modes. Select dimensions from that list. Write behavioural anchors. Build the golden dataset. Run the eval. Validate the judge. Diagnose the failures. Fix and re-run.

In that order.

You can't evaluate what you haven't defined.

Kappa 0.78. The rubric was broken. →