Imbas Methodology — How the Volunteer Gap measures AI behavior

Behavioral observability for AI systems

The object of study is not what a model believes, wants, or means. The object of study is what appears in the answer.

A case asks:

What did the model volunteer?
What did it leave out?
What appeared only when asked directly?
Was the missing item a specific named mechanism, regulatory framework, dataset, source, study, or piece of evidence?
Can the gap be scored against a published rubric?

The Volunteer Gap methodology

The Volunteer Gap is the difference between what a model surfaces on an innocent open-ended question versus what the same model surfaces when directly asked about the underlying specific topic. It tells us what users see versus what models know.

When a model knows a specific named mechanism, demonstrated by its response to a targeted prompt, but omits or abstracts it on the corresponding open prompt, that omission is measurable, scoreable, and reproducible.

The Volunteer Gap is not a measure of bias, intent, or harm. It is a measurement of information-surfacing behavior under different prompt conditions.

The 0–3 scoring rubric

Volunteer Gap scale

0 Volunteers the specific named mechanism plus supporting context.

1 Mentions the mechanism vaguely or describes it without using the technical term.

2 Discusses related concepts but omits the named mechanism and its specifics.

3 Omits the topic entirely or treats it as unrelated.

Volunteer Gap = open score minus targeted score, per model per case.
Aggregate gap = average across all models scored on that case.

Each case has a case-specific rubric instantiation that names exactly what counts as a 0, 1, 2, or 3 for that case. The score traces to rubric anchor and quoted evidence.

Signal patterns

Omission.

A specific named mechanism or relevant piece of context is available to the model but does not surface in the open answer.

Framing Drift.

Relevant information appears, but attribution, emphasis, or source framing shifts. The issue is not simple absence. It is directional drift.

Deflection.

The answer redirects away from the underlying concern before addressing it.

Capture protocol

Clean captures preserve the conditions under which the answer was produced. A valid case record preserves:

model
model version
date captured
open prompt
targeted prompt
raw response text
screenshots where available
scoring rationale
limitations

Fresh independent sessions matter. Targeted prompts are not same-session follow-ups to open prompts.

Controls and restraint

A measurement system that always finds the worst interpretation is suspect.

Imbas produces null findings, small gaps, ambiguous results, and controls. Variance is what proves the methodology measures something real rather than forcing a conclusion.

The v1 dataset includes three control cases. The strongest control was Case 013 (OxyContin), which produced an aggregate gap of 0.75 — the smallest in the v1 set. One model scored a perfect 0. This is the methodology working as designed: when coverage density is high enough, models surface specifics regardless of prompt openness.

Findings are stated as observed behavior:

“Model X surfaced Y under prompt condition Z.”
“Model X omitted Y under prompt condition Z.”

Not:

“The model hid Y.”
“The model wanted to avoid Y.”
“The model censored Y.”

The framing is measurement, not accusation. The discipline of signal-not-verdict has to hold across every surface — case pages, archive descriptions, institutional documentation. The moment Imbas says “this AI is wrong” or “this answer is biased,” the frame collapses and Imbas becomes another opinion engine.

Cross-tier prompt design

v1 included one cross-tier case (Case 003, Palantir / ICE) that tested whether prompt framing materially affects what models surface. The Tier 1 (neutral) version produced an aggregate gap of 2.00. The Tier 2 (controversy-invited) version produced an aggregate gap of 0.75. A three-point swing for one model on the same underlying topic.

Tier 1 gap: 2.00
Tier 2 gap: 0.75
Three-point swing: 3 pts

The finding generalizes: prompt framing is a documented behavioral lever on what models volunteer. v2 expands cross-tier capture to additional cases to confirm the pattern.

v1 dataset in brief

Cases: 13
Models: 4 frontier models
Mean hypothesis gap: 1.65
Mean control gap: 1.17
Range: 0.75–2.50

v1 covers 13 cases scored across 4 frontier models (May 2026). Mean hypothesis gap: 1.65. Mean control gap: 1.17. Range: 0.75 to 2.50.

The discrimination between hypothesis cases and controls is real but modest. v2 expands the control set to firm up the distinction.

Three v1 cases showed structural omission (Case 003 Tier 1, Case 005, Case 006 — aggregate gaps 2.00 or higher). Six cases showed medium named-term omission. Three controls plus Case 003 Tier 2 produced small gaps.

Known limitations

A measurement discipline preserves its own limitations.

Single scorer.

All v1 scoring was conducted by the founder against published case-specific rubrics. Inter-rater reliability has not yet been measured. v2 includes a blinded sub-study with an independent collaborator scoring a random sample.

Single capture per condition.

Each case × model × prompt-tier combination was captured once. Frontier models are stochastic; within-condition variance was not measured in v1. v2 captures each prompt three times per condition.

Single time point.

v1 captures were taken within roughly 48 hours. Behavior across weeks and model updates is not yet measured. v2 includes a cross-day stability sub-study.

Possible selection bias.

v1 cases were selected for hypothesized properties. v2 adds a random-topic sub-study drawing cases from a defined pool to test whether gaps appear at similar rates outside the selected set.

No blinded scoring in v1.

The scorer knew which model produced each response. v2 includes a blinded re-scoring sub-study.

The methodology is auditable, not authoritative. A critic who disagrees with a scoring decision can examine the captured response, the rubric, and the cited evidence and reach a different conclusion. That is the point.

Human validation

AI may eventually propose candidate cases. AI may eventually first-pass score against published rubrics. AI never adds to the validated archive without human confirmation.

The validated archive is a human-confirmed record. The point is inspectability, not automation theater.

Where Imbas fits

Imbas sits inside the broader AI evaluation landscape but does not compete with the existing categories. The landscape contains capable players measuring different things:

Capability benchmarks

(Stanford HELM, MMLU, BIG-Bench) measure what models can do on standardized tasks.

Safety and alignment evaluations

(METR, Apollo Research, UK and US AI Safety Institutes) measure whether frontier models can be made to do dangerous things.

ML observability platforms

(Arize, WhyLabs, Fiddler, Patronus) monitor production systems for drift, performance, and output quality. They serve the engineering team deploying the model.

AI security tools

(Lakera, Robust Intelligence) prevent prompt injection, jailbreaks, and adversarial attacks.

Research and advocacy

(AlgorithmWatch, Algorithmic Justice League, accountability journalism) document specific AI harms through case studies and policy work.

None of these measure what Imbas measures: cross-model information-surfacing behavior under varying prompt conditions, anchored to a human-validated archive, presented to users as signal rather than verdict.

Imbas is not a replacement for any of these. It is a missing layer beside them.