The first evidence-based model selection before a 155-hour unattended production run on 11,673 Illinois bills. Ollama backend, single inference stream, three candidate models, five chunk sizes. This evaluation selected Mistral 7B at 500 tokens. The May 2026 vLLM re-evaluation later reversed that recommendation — read below for exactly why, and why the original call was correct given what was measurable at the time.
The production run was estimated at 155 hours of unattended, sequential inference on a single GPU. A wrong model choice is not a quick re-run — it is a week of compute and a corpus of low-quality extractions. The decision had to be made on evidence, not on a smoke test.
Three candidate models were evaluated across five chunk sizes (500–2,800 tokens): Mistral 7B Instruct v0.3, Qwen 2.5 7B, and Llama 3.2 3B.
This evaluation used Llama 3.2 3B — a 3-billion parameter model — against 7B-class models. The May 2026 re-evaluation upgraded to Llama 3.1 8B for parameter parity. That upgrade, combined with the vLLM migration, changed Llama's ranking from last to first by 20+ percentage points. The original evaluation was not wrong — it was limited by the comparison set available at the time.
Hardware: Single NVIDIA L4, 23 GB VRAM. One GPU, one inference process — no horizontal scale-out.
Backend: Ollama — one request at a time, no continuous batching. TTFT and ITL not measurable on Ollama. This became a documented limitation and motivated the vLLM migration.
Pipeline: Bronze (ingest) → Silver (chunk → extract) → Gold (standardize). Each config writes to a profile-isolated Silver path — no shared state between cells.
Idempotency: Extractor flushes in batches and resumes after a kill — a precondition for multi-night unattended runs.
Not one experiment — a deliberate escalation. Each stage scoped to answer a question the previous stage raised. You don't spend a 15-hour run on a question a 10-minute run can answer.
Pointed at Qwen — the result was later overturned by a larger sample. This is why smoke tests don't make model decisions.
Model ranking flipped with 10× the sample. Mistral overtook Qwen — a textbook case for why you need a meaningful sample before drawing conclusions.
Ran unattended over a weekend. Warmup phase excluded cold-start bias. This is the primary dataset for the decision.
Blind to parse status — a structurally partial parse with strong substance outscores a clean parse with shallow answers. Measures whether the model understood the bill, not just whether it produced valid JSON.
| Model | 500 tok | 700 tok | 900 tok | 1200 tok | 2800 tok | Peak |
|---|---|---|---|---|---|---|
| Mistral 7B | 54.9% | ~48% | ~42% | ~35% | ~8% | 54.9% @ 500 |
| Qwen 2.5 7B | ~38% | ~44% | ~52% | ~35% | ~10% | ~52% @ 900 |
| Llama 3.2 3B | ~28% | ~25% | ~22% | ~18% | ~5% | ~28% @ 500 |
Approximate values from evaluation report. The 2,800-token collapse was model-independent — all three lost 10–11 of 15 fields with 60–67% EMPTY parse rates.
Every model collapses past ~2,000 tokens. At 2,800 tokens, EMPTY-parse rates hit 60–67% across all models. Because the degradation is model-independent, the chunk-size decision is separable from the model decision — a property of 7B-class models on dense legal text at long context, not any one model's quirk.
| Model | Chunk size | Overall (0–2) | Notes |
|---|---|---|---|
| Qwen 2.5 7B | 900 | 1.87 | Highest single score in review |
| Mistral 7B | 500 | 1.68 | Winner at 500 tokens; lowest variance — decision basis |
| Mistral 7B | 900 | 1.58 | Close second |
| Llama 3.2 3B | 500 | ~1.2 | Last; recurring policy-domain mis-tag |
Cross-configuration variance: ±0.10 between chunk sizes. No catastrophic failure mode. No saturation signature at any chunk size. For a 155-hour unattended run on a shifting corpus, predictability was weighted above peak score.
Peak score 1.87 at chunk_size=900 but cross-configuration variance of ±0.35 — 3.5× more fragile than Mistral. Known "None." failure mode at certain chunk sizes. Retained as a candidate for later revisit.
Utilization, Saturation, Errors applied to the L4 GPU. USE characterizes infrastructure health — not output quality. Separate axes.
L4 ran effectively saturated on a single inference stream. Pre-flight checks confirmed 0% idle utilization before each run. The concurrency headroom a single stream leaves unmeasured was the primary motivation for the vLLM migration.
The Qwen 2.5 × 1,200-token cell is a textbook saturation signature: p99 latency 137s vs 13–16s for healthy cells, throughput collapsed 7–9×, 32 of 165 chunks lost. Reproduced three times. p99 detaching from median while throughput collapses = work arriving faster than the resource can drain it.
Outside the Qwen 1,200-token saturation cell, errors were dominated by parse failures, not crashes. Parse status (OK / PARTIAL / EMPTY) was promoted to a first-class signal so this class of error is counted, not absorbed.
TTFT, ITL, KV cache utilization, and per-stream decode throughput were not measurable on Ollama's single-stream architecture. These metrics — used extensively in the May 2026 vLLM evaluation — were explicitly flagged as a known limitation at evaluation time.
A characterization is only as credible as its statement of what it did not prove.
Several fields came back empty for every model at every chunk size. A uniform failure points at the prompt or parser, not any model. Flagged as the highest-value pre-production action item. The May 2026 evaluation refined the extraction schema to 15 fields with clearer separation between binary and substantive fields.
Throughput numbers from this evaluation are a valid floor, not a capacity model. The May 2026 concurrency sweep found that vLLM at c=24 delivers 9× Ollama's throughput — a gain completely invisible here. Mistral's p99 explosion under concurrency (15s → 74s) was also invisible on Ollama.
This evaluation compared Llama 3.2 3B against Mistral and Qwen at 7B. A 3B model finishing behind 7B models is expected and should not be interpreted as a statement about the Llama model family. When May 2026 upgraded to Llama 3.1 8B for parameter parity, Llama won by 37+ percentage points.
Rubric scores are expert judgments, not measurements against a gold standard. The May 2026 evaluation addressed this with two independent Opus 4.8 passes (high effort then max effort), finding 60/60 agreement and 0 revisions.
Change 1 — Backend: Ollama → vLLM. Ollama cannot expose per-stream TTFT, ITL, KV cache utilization, or real concurrency behavior. Moving to vLLM 0.6.6 made those measurements possible for the first time. The most consequential discovery: Mistral's p99 latency grows 5× under concurrency (15s → 74s), driven by KV cache head-of-line blocking from 3,000-token prompts. This disqualifying failure mode was completely invisible to the Ollama evaluation.
Change 2 — Model: Llama 3.2 3B → Llama 3.1 8B. Upgrading to a parameter-parity model changed the answer. Llama 3.1 8B at chunk_size=1,200 achieves 92.5% parse-OK — 37.6pp above Mistral's original winning score of 54.9%. The original evaluation was not wrong to rank Llama last — a 3B model losing to 7B models is expected. It was wrong to draw conclusions about the Llama family from that result.
Neither change was a methodology failure — both were resource constraints at evaluation time. The lesson: the answer you get depends on the question you can ask, and the question you can ask depends on your measurement stack.
The complete May 2026 vLLM evaluation — six experiments, 126 cells, qualitative review, SLO analysis — is documented in the main case study ↗. Current recommendation: Llama 3.1 8B AWQ at chunk_size=1,200, concurrency=12–24.