T.Erbe
Opus 4.8 · Pass 1 High · Pass 2 Max ← Case Study

Project 10 · Quality Validation / Experiments 3 & 3.5 / May 2026

Does parse-OK actually measure quality?

Two rounds of scoring on real Illinois bill extractions — first across models, then within each model across chunk sizes. The short answer: yes across models, no within models. Here is every chunk, every model, and the rationale behind every score.

120
Chunks Scored
20 chunks × 6 profiles (Phase 1) + 5 pairs (Phase 2)
Opus 4.8
Both Passes
Pass 1: high effort · Pass 2: max effort, independent
60 / 60
Cross-Model Confirmed
Phase 1 · parse-OK ranks models correctly
0 / 4
Within-Model Correct
Phase 2 · parse-OK fails to rank chunk sizes

01 / Question

Does parse-OK rank models correctly?

The cross-model question: when parse-OK says configuration A scores higher than B, does substantive quality agree? Answer: yes — confirmed on all 60 cells.

Three profiles at their quality-best

llama31_1200 — Llama 3.1 8B AWQ · 1200-token chunks. Parse-OK peak: 92.5%.

qwen_500 — Qwen 2.5 7B AWQ · 500-token chunks. Best Qwen parse-OK at time of review: 62.5%.

mistral_3000 — Mistral 7B AWQ · 3000-token chunks. Parse-OK peak: 72.0%.

Each profile pairs a model with a chunk size — context-window dimension is part of what's being evaluated.

Two independent passes

Pass 1 — Claude Opus 4.8 at high effort. Scored each profile's extractions against source text, blind to parse_status. 60 cells.

Pass 2 — Claude Opus 4.8 at max effort, fresh context, no access to pass-1 scores. Independent re-evaluation of all 60 cells.

Result: 60/60 agreement. Zero cells revised. Pass-2 found the scoring internally consistent and rubric-aligned.

◆ Critical methodological choice

Scoring used chunk-fidelity: did the model extract this chunk correctly? Not whole-bill understanding. At Chunk 8, mistral_3000's large context window produces a highly specific answer about a different section of the same bill. Under chunk-fidelity that is Accuracy:1. Under whole-bill summarization it would flip to 2. Both passes flagged this as the single call the result turns on.

02 / Rubric

Three dimensions, scored 0–2

Per-cell average = (Accuracy + Specificity + Completeness) / 3.

Accuracy

0Wrong — answers about the wrong thing.
1Partially right — right topic, off-focus or off-section.
2Accurate — faithful to this chunk.

Specificity

0Generic — boilerplate, no detail.
1Somewhat specific.
2Names the actual mechanism.

Completeness

0Surface phrase only.
1Partial point captured.
2Substantive point captured.

03 / Results

Final standing

Aggregate scores across all 20 chunks. Pass 1 and pass 2 agree on all 60 cells.

Quality scores — A / S / C breakdown per profile

◆ The finding

Context-window size interacts with extraction fidelity. The 3000-token config imports content from other sections of the same bill and loses the chunk's headline; the 1200-token config stays on-chunk most reliably; the 500-token config is faithful but too compressed to be complete. On six self-contained single-amendment bills all three converge near 2.0 — on multi-section bills, mistral collapses while llama holds. This is a property of the configurations (model + chunk size together), not the models alone.

04 / The Review

All 20 chunks, every extraction

Expand any chunk to see the source text, each model's full extraction across all 11 fields, and the pass-1 score with rationale confirmed by pass 2.

01 / Question

Does parse-OK rank chunk sizes within a model?

Phase 1 confirmed cross-model reliability. Phase 2 asks a harder question: when parse-OK says chunk size X is better than Y for the same model, does substantive quality agree?

◆ Answer: No

In every informative pair, the smaller chunk size scored ≥ the larger. The parse-OK peak is not the quality peak for any of the three models. The mechanism is consistent: larger chunks cause off-section drift, producing parse-OK-invisible failures — the model populates all 11 fields with confident, specific answers about a different section of the same bill.

This does not change the production recommendation. The cross-model ranking (Phase 1, 60/60) is unaffected. Llama leads by 20+ percentage points. What this finding adds: the proxy's within-model reliability is now characterized — directional for Llama, reversed for Qwen and Mistral.

Three models, three chunk sizes each

Each model was scored at the chunk sizes that bracket its parse-OK peak. Smaller size first.

Llama: 900 · 1200 · 2000 — scored 900 vs 2000 (1200 is the peak, not directly scored)

Qwen: 2000 · 3000 · 4000 — scored 2000 vs 3000 and 3000 vs 4000

Mistral: 2000 · 3000 · 5000 — scored 2000 vs 3000 and 3000 vs 5000

Dominant failure mode: off-section drift

A larger chunk gives the model more bill context. Instead of staying on the displayed chunk, it anchors on an adjacent section — and produces specific, fully-populated JSON about the wrong part of the bill. Parse-OK validates the shape; it cannot see the mis-reference.

Additionally Qwen emits structurally-valid JSON with None values across fields — passing parse-OK while containing no information.

02 / All Profiles

Each profile, standing alone

Select a profile to see its scores and all 20 expandable chunks. Profiles are grouped by model so you can read across chunk sizes naturally.

03 / Cross-Cutting Patterns

What the data shows

Patterns confirmed across all scored pairs.

Off-section drift is the dominant failure

Every model at every larger chunk size shows it. The model gives a confident, specific answer about a neighboring section — correct about the bill, wrong about this chunk. Accuracy and Completeness drop; Specificity stays high because the answer IS specific, just misattributed.

Qwen additionally emits structural nulls

At chunk_size=3000, Qwen returns parse_status=OK, missing=0 while all 11 fields contain "None.". Chunk 2 of the Qwen 2000 vs 3000 pair is the clearest single case: fully valid JSON, substantively empty. Parse-OK cannot see this.

Completeness carries the gap

In every pair the quality gap is concentrated in Completeness. Accuracy drops slightly; Specificity barely moves (drifted answers stay specific). But Completeness collapses because the displayed chunk's actual content is never extracted.

HB4820 is a cross-model fingerprint

Both qwen_3000 and mistral_3000 drift to the same wrong section on the identical HB4820 general-definitions chunk — a veterans-housing operative provision. The off-section-drift mechanism is shared across model families at the larger chunk size.

Quality vs parse-OK across chunk sizes — all models
Substantive quality (0–2 scale)
parse-OK rate (%)