Pipeline ArchitectureMulti-Signal AI

Never split the patient.

A common architecture decision that looks efficient and quietly makes your health AI blind to the findings that matter most.

There's a design decision that shows up constantly in health AI, and it looks reasonable until you understand what it's costing you.

Blood reports have many markers. Rather than feed the entire report into one AI call, the instinct is to chunk it. Marker by marker. Keep each call small, focused, interpretable. It seems efficient. It maps cleanly to how most data pipelines are built.

The system answers correctly on most markers it's asked about. The accuracy numbers look fine.

The system cannot detect a single cross-marker pattern. It has never seen one. Each call only sees one marker.

Why this is a structural failure, not a prompt problem.

There are conditions where every individual value sits within reference range, but the combination is the signal. A B12 level that looks acceptable. A methylation marker at the low edge of normal. A cognitive complaint in the intake notes. Individually, none of these flag. Together, in a senior patient with a specific pharmacogenomic profile, they indicate a real clinical concern.

If the AI only ever sees one marker per call, it can never find this pattern. You can improve the prompt all you want. The architecture prevents the finding.

This is the rule: split by task, never by data. Every AI call that's meant to assess a patient gets the full patient context. The chunking you do is functional, what question is this call answering, not data-based, which slice of the report does this call see.

What the correct architecture looks like.

One call assesses glycometabolic risk. It gets the full report.

One call assesses cardiovascular markers. It gets the full report.

One call assesses the methylation and cognitive domain. It gets the full report.

One call synthesises the findings. It gets all outputs from above, plus the full report.

The full report is present in every call. The task each call is answering is what's different, not the data it can see.

This is more expensive per call. It's the correct trade-off. A health system that misses cross-marker patterns because the architecture was optimised for token efficiency is not a health system. It's a more expensive version of looking at each number in isolation.

The Indian market note.

A full-body panel in the Indian diagnostic market runs 80 to 100 parameters or more, depending on the lab. Reference implementations built for Western markets often assume 30 to 40 markers. The context window requirements, the call cost, the synthesis complexity: all of it is larger here.

Design for the actual population you're serving, not the reference implementation you learned from.

Two distinct eval types on the same pipeline.

Once the architecture is right, there are two separate evaluation problems on a health AI pipeline. They're regularly conflated, and conflating them produces misleading results.

The first is detection evals. Is the pipeline catching the conditions it's supposed to catch? The right metrics here are MCC and recall-first. MCC because the classes are imbalanced (most panels don't have a critical finding, so accuracy alone is not enough). Recall-first because in a clinical context, a missed positive is more dangerous than a false positive.

The second is judge-validation evals. Is the LLM judge that scores the pipeline's outputs reliable? The right metrics here are Kappa and MCC comparing two raters. This is not the same question as detection accuracy. A pipeline can catch conditions well and still be measured by an unreliable judge. The two questions need separate answers.

Most teams run one eval and assume it covers both. It doesn't. Build them separately. Run them on separate cadences.

The one question to ask first.

Before building any AI pipeline that processes multi-signal data, ask: is there any finding I need to surface that requires seeing more than one signal at once?

If yes, and in health it almost always is, the architecture must ensure every call that contributes to that finding has the full context. Not the relevant slice. The full context.

Never split the patient.

You can't evaluate what you haven't defined. →