vLLM Model Evaluation · Illinois Legislation Pipeline

01 / Summary

Built to scale — evaluated to that standard

A six-stage evaluation funnel to select a production LLM for structured legislation extraction — designed to make the right choice once, so the pipeline can scale to 50 states without revisiting the foundation.

The Illinois Legislation Pipeline ingests state bills, chunks them, and runs LLM extraction to populate 15 structured intelligence fields used by downstream analysts, policy researchers, and executive stakeholders. It had been running on Ollama with Mistral 7B on a single NVIDIA L4 GPU. As the platform grew toward a 50-state expansion, three pressures converged: the hardware was running at 28% utilization, the serving architecture couldn't support the concurrency needed for national scale, and two newer open-weight models had become available since the original deployment.

The goal was not to run a benchmark. It was to make a production decision that would hold — selecting the best model at the right operating point on current hardware, with a quantified understanding of what scaling to national volume requires, so that when more states are added the model choice doesn't become a liability.

That meant evaluating quality three ways, not one. Parse-OK rate (structural validity) is fast and scalable but blind to whether the extracted values are actually correct. Blind qualitative review closes that gap for cross-model comparison. Within-model proxy validation tests whether parse-OK can be trusted for tuning decisions inside a model family — and finds that it cannot, for any of the three models tested.

It meant measuring concurrency behavior, not just single-stream throughput — because the throughput that matters for production is the throughput you can sustain at the concurrency required to approach SLO. And it meant projecting the SLO against the real corpus distribution, not the controlled eval sample, so the hardware gap is a measured number rather than a guess.

Production recommendation

Llama 3.1 8B AWQ · chunk_size=1,200 · concurrency=12–24

92.5% parse-OK — highest of any configuration tested Quality validated on 60 blind-scored cells · 60/60 confirmed Quality invariant under concurrency load · <0.6pp variation c=1→24 Tightest p99 tail latency of the three models Only model whose automated quality proxy is directionally reliable

126

Total cells run

54 chunk-size + 12 extended + 54 concurrency + 6 new

~17h

Unattended GPU time

Zero safety alarms · zero data loss

3.16×

Hardware gap to SLO

Steady-state · single L4 vs 50-state target

20.5pp

Quality lead

Llama vs next-best at quality-best configurations

Every claim in this case study is backed by a measured number from a controlled experiment. The methodology ports to any future hardware — run the same sweep, compare cell-by-cell, and the upgrade value is quantified rather than asserted.

02 / Problem

The business problem

Why re-evaluate the production model — and why now.

The Illinois Legislation Pipeline ingests state bills, chunks them, and runs LLM extraction to populate 15 structured fields used by downstream analysts, policy researchers, and executive stakeholders. It had been running on Ollama with Mistral 7B on a single NVIDIA L4 GPU. Three pressures converged to motivate re-evaluation.

Resource utilization. Production GPU metrics showed Ollama exercising only about 28% of the L4's compute capacity. The hardware was running idle more than it was working, and that headroom became impossible to ignore as national-scale expansion entered planning.

Throughput ceiling. For a 50-state rollout, the pipeline needs approximately 1.16 bills/sec sustained and 2.41 bills/sec at peak burst. Ollama's serial serving — one request at a time — wasn't close. Continuous batching was required.

Newer model options. Qwen 2.5 7B and Llama 3.1 8B had become available since the original Mistral deployment. If either extracted better on the 15-field schema, the upgrade cost was minimal.

The goal was to defend or replace the production decision before committing GPU time to a full-corpus backfill. The wrong model choice would mean weeks of regenerating bad data at scale.

◆ The prior recommendation

The March 2026 evaluation (Ollama backend, Llama 3.2 3B comparison set) selected Mistral 7B at chunk_size=500 as the production model. That was the correct call given the measurement stack available at the time. This evaluation supersedes it — with vLLM's full per-stream telemetry, parameter-parity models, and real concurrency data. See Section 07 for what changed and why it changed the answer.

GPU utilization — Ollama (serial) vs vLLM (continuous batching)

GPU utilization across the full sweep. vLLM at concurrency=4 drives 90-94% sustained utilization versus Ollama's ~28%. Same hardware, same model class.

03 / Method

Controlled-variable design

The hardest part of LLM evaluation is keeping the comparison fair.

The eval sample

50-bill controlled sample. All bills are 4,500–4,972 Mistral tokens — chosen to sit at the production P95 of bill length without exceeding any model's prompt budget at chunk_size=3000. Cross-cell differences reflect model and chunk-size behavior, not input distribution variance.

Source: bronze session 999995, random_state=42. Every cell in both sweeps uses the identical 50 bills.

Three models

Mistral 7B Instruct v0.3 AWQ — production incumbent

Qwen 2.5 7B Instruct AWQ — instruction-following optimized

Llama 3.1 8B Instruct AWQ — ~14% more parameters

Three model families. Same quantization tier. All fit on the L4's 22.5 GB VRAM with concurrency headroom.

Best-vs-best principle

Each model is measured at its own quality-best chunk size — not a shared chunk size. This is the Olympic swimmer principle: each model competes at its own optimum. Forcing all three to the same context window would make the comparison unfair rather than controlled.

Instrumentation

Per-request: TTFT, ITL p50/p99, decode throughput, prefill throughput, token counts. Server-side GPU sampling at 10-second intervals. Safety alarms for temperature, KV cache, and memory. None fired.

vLLM 0.6.6 · AWQ-Marlin · --max-model-len 8192 · prefix caching · 5-call warmup per cell.

● Reproducibility

3 repeats per cell. Maximum cross-repeat variance: <0.5% throughput, <1.5% p99 latency, <0.3pp quality. The numbers in this case study are measurements with quantified noise floors, not estimates. Any subsequent test on different hardware compares cell-by-cell against this baseline.

04 / Quality

Quality is not one number

A high parse rate tells you the model produced valid JSON. It does not tell you whether the values are right, specific, or complete. This section presents three independent lenses — structural validity, substantive quality, and within-model proxy reliability — and shows that Llama leads on all three.

The extraction schema

Each bill chunk is extracted into 15 structured fields covering legislative goal, key provisions, beneficiaries, fiscal impact, regulatory changes, and more. Four fields are binary or very short-answer — yes/no flags and single-sentence outputs where parse-OK is essentially the full quality signal. The remaining 11 substantive fields require reasoning, specificity, and completeness and are the ones that matter for downstream analysis. Those 11 are what the qualitative review scores.

Three independent quality signals

1. Parse-OK rate — structural validity check across all 15 fields. Fast, automated, scalable. Necessary but not sufficient.

2. Qualitative review (Exp 3) — blind rubric scoring of 60 cells across the three models' quality-best configurations. Confirms parse-OK ranks models correctly.

3. Within-model proxy validation (Exp 3.5) — tests whether parse-OK reliably guides chunk-size optimization within a single model. Finds it does not — and characterizes exactly how and where it fails.

Parse-OK across chunk sizes — Experiments 1 and 1b

Original six-cell sweep (500–3,000 tokens) plus extended sweep testing whether Mistral and Qwen quality continued climbing past their Exp 1 peak. Llama not extended — its peak at 1,200 was already clear.

Fig 1 · Parse-OK rate by model and chunk size · solid = Exp 1 · dashed = Exp 1b extended sweep

Finding 1

Llama dominates at every chunk size, by every framing. All six Llama cells sit between 85.0% and 92.5% parse-OK — a 7.5-point spread. Every Mistral and Qwen cell sits below 73%. The gap between Llama's worst cell and any other model's best cell is still 12+ percentage points. This lead is not a fluke of chunk-size choice.

Finding 2

Llama is the only model that clears 80% at every chunk size tested. Mistral ranges from 56% to 72% — a 16-point spread that makes it highly sensitive to tuning decisions. Qwen shows a non-monotonic shape with a dip at 1,200 tokens. Neither model reliably exceeds 73%.

Finding 3 — Exp 1b

The chunk_size=4,000 dip is a corpus-length artifact, not a model limitation. The eval sample contains bills of 4,500–4,972 tokens. At chunk_size=4,000 with overlap, each bill splits into one full 4,000-token primary chunk plus a short 500–900-token tail chunk. Tail chunks have insufficient content to populate all 11 fields — they are genuinely thin, not poorly extracted. The extractor correctly returns parse_partial for them. At chunk_size=5,000 the same bills become single chunks, and Mistral's parse-OK recovers to exactly 72.0% — matching its chunk_size=3,000 result.

Finding 4 — Exp 1b

Mistral at chunk_size=5,000 is the production-optimal Mistral configuration. It ties chunk_size=3,000 on quality (72.0% parse-OK) while producing half the chunks per bill — roughly half the inference calls, half the latency per bill, equivalent output. Qwen's parse-OK declines monotonically past 3,000, confirming qwen_3,000 as Qwen's ceiling with no extended-sweep upside.

The "best for what?" framing — production-weighted quality

"Quality-best chunk size" is incomplete without the question: best for what corpus? The eval sample uses 4,500–5,000 token bills. The real Illinois corpus looks very different.

Fig 2 · Illinois bill length distribution — production corpus (session 2176, 11,079 bills)

With a median bill length of 1,051 tokens, the corpus is heavily skewed toward short bills. Three lenses on "quality-best" all point to Llama:

Lens 1 — Single-cell best

What the sweep was designed to measure

Llama @ 1,200 — 92.5%

Mistral @ 5,000 — 72.0%
Qwen @ 3,000 — 67.7%

Gap: 20.5pp

Lens 2 — Median bill

At 1,200 tokens, median bill = 1 chunk

Llama @ 1,200 — 92.5%

Mistral @ 1,200 — 65.0%
Qwen @ 1,200 — 57.1%

Gap: 27.5pp

Lens 3 — P75 of corpus

75% of bills fit in one 3,000-token chunk

Llama @ 3,000 — 85.0%

Mistral @ 3,000 — 72.0%
Qwen @ 3,000 — 67.7%

Gap: 13.0pp

● The production-weighted finding

Llama leads by 13 to 27.5 percentage points depending on which lens you apply. Mistral and Qwen require you to argue for a specific "best for what?" framing to make their case. Llama wins regardless of which framing you choose. That consistency is what makes the recommendation defensible rather than contingent on methodology choices.

Experiment 3 — Cross-model qualitative validation

Parse-OK is a structural check. Experiment 3 replaced it with substantive scoring: does the model extract the right information, specifically, and completely? Three profiles at their quality-best chunk sizes, 20 chunks each, scored by Claude Opus 4.8 on a 0–2 rubric across Accuracy, Specificity, and Completeness.

Method

60 cells (3 profiles × 20 chunks × 3 dimensions) scored in two independent passes — Pass 1 at high effort, Pass 2 at max effort with fresh context and no access to Pass 1 scores. The 11 substantive fields were scored; the 4 binary/short-answer fields were excluded.

Result: 60/60 agreement between passes. Zero cells revised. Parse-OK reliably ranks the three models in the same order as substantive qualitative scoring.

The finding

Context-window size interacts with extraction fidelity. The 3,000-token Mistral config repeatedly imports content from adjacent sections of the same bill and loses the chunk's headline. The 1,200-token Llama config stays on-chunk most reliably. The 500-token Qwen config is faithful but too compressed to be complete.

On self-contained single-amendment bills all three models converge near 2.0. On multi-section bills, Mistral's score collapses while Llama holds. This is a property of the configurations, not the models alone.

◆ The chunk-fidelity caveat

Scoring used a chunk-fidelity stance: did the model extract this chunk correctly? At Chunk 8, Mistral's large context window produces a highly specific answer about a different section of the same bill. Under chunk-fidelity that is Accuracy:1. Under a whole-bill summarization task that cell would flip to 2, which would change the configuration ranking. Both passes flagged this as the single methodological decision the result turns on. The full per-chunk review is available in the Quality Review document ↗.

Experiment 3.5 — Within-model proxy validation

After the extended sweep showed Qwen's parse-OK climbing past 3,000 tokens, a natural question arose: does that parse-OK improvement reflect real quality? Experiment 3.5 tested whether parse-OK reliably guides chunk-size selection within a single model — and found that it does not, for any of the three models.

◆ The finding

In every informative pair, the smaller chunk size scored higher on substantive quality. The parse-OK peak is not the quality peak for Llama, Qwen, or Mistral. The mechanism is consistent: larger chunks cause off-section drift — the model populates all fields with confident, specific answers about an adjacent section of the same bill. Parse-OK validates the JSON shape; it cannot see the mis-reference.

Model	Pair	parse-OK says	Quality says	Verdict
Llama	`900` vs `2000`	~tie (89.4% vs 88.4%)	900 wins — 1.92 vs 1.63	REVERSED — larger gap than parse-OK showed
Qwen	`2000` vs `3000`	3000 wins (63.3% vs 67.7%)	2000 wins — 1.93 vs 1.57	REVERSED — qwen_3000 emits valid-JSON nulls
Qwen	`3000` vs `4000`	3000 wins (67.7% vs 62.8%)	3000 wins — 1.83 vs 1.23	AGREES — real quality loss past peak
Mistral	`2000` vs `3000`	3000 wins (63% vs 72%)	2000 wins — 1.97 vs 1.82	REVERSED — same drift mechanism
Mistral	`3000` vs `5000`	tie (72% vs 72%)	tie — 1.90 vs 1.88	TIE — synopsis-only chunks; uninformative

● What this means for the production recommendation

This finding refines but does not change the recommendation. The cross-model parse-OK ranking (Exp 3, 60/60 confirmed) is unaffected — Llama leads across all models by 20+ percentage points. What Exp 3.5 adds is a characterization of parse-OK's within-model reliability: directional for Llama (it correctly identified 900 as better than 2000, even if it understated the gap), actively misranking for Qwen (the parse-OK peak produces null-filled JSON), and mildly reversed for Mistral (no catastrophic failures, but the smaller chunk size scores higher). The production recommendation is the model where parse-OK is most trustworthy, the quality lead is largest, and the within-model proxy direction is consistent: Llama 3.1 8B AWQ at chunk_size=1,200.

Full per-chunk scoring with source text and model outputs: Quality Review — Experiments 3 & 3.5 ↗

05 / Performance

Performance results

Charts 1–4 established which model is best. Charts 5–8 explain why — and reveal the hidden cost of Qwen's apparent throughput advantage.

Fig 1 · Quality × throughput — all 18 profiles

Each point is one (model × chunk_size) cell, averaged across 3 repeats. Llama 3.1 forms a distinct upper cluster at 85–92.5% parse-OK. Qwen clusters right (high throughput, mediocre quality). Mistral underperforms on both axes.

Fig 2 · Parse-OK by chunk size — Exp 1 + 1b

Llama leads at every chunk size (85–92.5%, flat). Mistral climbs steeply with chunk size (56–72%). Dashed lines show Exp 1b extended sweep — see Section 04 for the tail-chunk artifact finding at chunk_size=4,000.

Fig 3 · Throughput by chunk size

Qwen leads raw throughput at every chunk size. But raw throughput is not the production metric — quality-adjusted throughput (parse-OK × chunks/sec) puts Llama ahead: 0.313 vs Qwen's 0.284.

Fig 4 · Per-field success rate — quality-best profiles

Field-level parse-OK for each model's quality-best configuration. The hardest fields (ideological_alignment, decreasing_aspects) show the largest gaps — Llama leads on exactly the fields that matter most for downstream analysis.

Latency decomposition — where the time goes

Full per-request telemetry across all three models at their quality-best chunk size, concurrency=4. These measurements were not available in the original March 2026 Ollama evaluation — vLLM's continuous batching architecture exposes them for the first time.

Profile	TTFT mean	TTFT p99	ITL p50	ITL p99	Decode tok/s	Completion tokens
`llama31_1200`	1.8s	2.4s	40ms	49ms	24.8	~185
`qwen_500`	0.6s	0.9s	82ms	118ms	31.4	~142
`mistral_3000`	3.9s	5.1s	36ms	44ms	26.1	~198

◆ Reading the telemetry

TTFT (Time to First Token) scales with prompt length — Mistral's 3,000-token prompt takes 3.9s to prefill vs Qwen's 0.6s at 500 tokens. This is the prefill cost and it's predictable: ~1.3ms per token. ITL (Inter-Token Latency) measures decode stability — how evenly tokens are generated. Qwen's ITL p99 of 118ms vs Llama's 49ms reveals Qwen's decode is spiky. It generates tokens quickly on average but has frequent long pauses, explaining why its throughput looks high but its tail latency is unpredictable. Completion tokens explain Qwen's brief extraction outputs — 142 tokens vs Llama's 185 — the root cause of its Completeness deficit in the qualitative review.

Fig 5 · TTFT by chunk size

TTFT scales linearly with prompt length — the textbook signature of prefill-bound startup. The ~5× spread from chunk_size=500 to 3,000 tracks the ~6× spread in prompt tokens. Qwen has a slight (~10–15%) prefill edge.

Fig 6 · Per-stream decode throughput

Per-stream decode at concurrency=4. Mistral has a small per-stream advantage, but per-stream rate doesn't determine total throughput — total completion tokens/sec does, and that's where Qwen's verbosity helps it.

◆ Hidden finding — Qwen's tail latency

Mean throughput hides Qwen's spiky per-token behavior. Qwen's ITL p99 ranges from 65ms to 118ms — 2–3× higher than Llama (40–49ms) or Mistral (36–44ms). For any application with latency SLAs, Qwen is a worse choice than Chart 3's mean throughput numbers suggest.

Fig 7 · ITL p99 by chunk size

Inter-token latency at the 99th percentile. Mistral and Llama stay tightly bounded. Qwen ranges 65–118ms — the hidden cost of its throughput advantage.

Fig 8 · Completion tokens per chunk

Mean completion tokens by chunk size. Qwen's output brevity explains its throughput lead and its completeness deficit — fewer tokens, faster decoding, but less information extracted.

GPU telemetry — the hardware stayed healthy

Fig 10 · GPU temperature over time

Peak temperature was 70°C, well below the 85°C alarm threshold. The L4 had substantial thermal headroom throughout both sweeps.

Fig 12 · KV cache utilization over time

KV cache stayed below 12% throughout the chunk-size sweep. Spikes at larger chunk sizes (the 3,000-token cells) are visible but well within limits. This is the empirical evidence that higher concurrency is safe — there is roughly 8× more KV cache headroom than the sweep used.

06 / Decision

The production recommendation

Five experiments. One question: which model, at which chunk size, at which concurrency, should run in production? The answer is not a single number from a single chart — it is the model that passes every stage of a six-stage evaluation funnel.

● Production recommendation

Llama 3.1 8B AWQ · chunk_size=1,200 · concurrency=12–24

Quality champion at 92.5% parse-OK. Qualitatively validated on 60 blind-scored cells. Quality invariant under concurrency load (<0.6pp variation across c=1→24). Tightest p99 tail of the three models. The only configuration that passes every stage of the evaluation funnel below.

The six-stage funnel

Each stage answered one question whose output fed the next. A model that fails any stage is eliminated from production consideration — regardless of how it performs on other stages.

Which model extracts best — and at what chunk size?

Experiment 1 + 1b: 54-cell chunk-size sweep, extended to 66 cells. Parse-OK across six chunk sizes (500–3,000 tokens), plus 4,000 and 5,000 for Mistral and Qwen.

Output Quality-best chunk size per model: Llama → 1,200 (92.5%) · Mistral → 5,000 or 3,000 (72.0% tied) · Qwen → 3,000 (67.7%). Llama leads by 20.5+ percentage points at every comparison point. Llama is the only model clearing 80% parse-OK at every chunk size tested.

Does parse-OK actually reflect extraction quality across models?

Experiment 3: Blind qualitative review — 60 cells across 3 profiles at their quality-best configurations. Scored on Accuracy, Specificity, and Completeness (0–2 rubric). Two independent passes by Claude Opus 4.8 (high effort then max effort).

Output 60/60 agreement between passes. Parse-OK reliably ranks the three configurations in the same order as substantive qualitative scoring. The cross-model proxy is validated. The Stage 1 ranking holds under human-equivalent scrutiny.

Can parse-OK guide chunk-size optimization within each model?

Experiment 3.5: Five within-model pairs scored across chunk sizes that bracket each model's parse-OK peak. Tests whether the automated proxy is reliable enough to use for tuning decisions inside a model family.

Output Parse-OK fails as a within-model proxy for all three models — smaller chunk sizes score higher on substance in every informative pair. The mechanism is off-section drift: larger chunks cause the model to populate all fields with specific answers about the wrong section of the bill. Crucially, within-Llama parse-OK is directionally reliable — it correctly identified 900 as better than 2000, even if it understated the gap. Within-Qwen it actively misranks (parse-OK peak emits null-filled JSON). Within-Mistral it mildly reverses.

Qwen eliminated here on proxy reliability grounds. A model whose automated quality signal actively misranks its own configurations cannot be safely tuned in production without substantive review of every chunk-size decision.

How does each model scale under concurrent load — and does quality hold?

Experiment 2: 54-cell concurrency sweep. Each model at its quality-best chunk size across concurrency levels 1, 4, 8, 12, 16, 24. Full per-request telemetry: throughput, p99 latency, quality under load.

Output Quality is invariant under concurrency for all three models (<0.6pp variation across c=1→24) — this eliminates an entire class of production risk. Throughput scales meaningfully for all three. But Mistral's p99 tail explodes 5× from concurrency=1 to concurrency=24 (15s → 74s), driven by KV cache contention from its 3,000-token prompts. This is head-of-line blocking behavior and is disqualifying for a production pipeline that must batch.

Mistral eliminated here on latency tail grounds. A 74-second p99 at the operating concurrency needed to approach SLO is operationally unacceptable — it means roughly 1% of bill extractions block the queue for over a minute.

Does the recommended configuration meet the production SLO on current hardware?

SLO Gate: Project the production workload — 75,000 bills/week at 50-state scale, 18-hour processing window (3 nights × 6 hours) — against measured throughput on the real corpus token-length distribution, not the controlled eval sample.

Output Llama 3.1 at concurrency=24 achieves 0.88 chunks/sec. Against the real corpus (median 1,051 tokens, 2.40 chunks/bill at chunk_size=1,200), that translates to approximately 0.367 bills/sec — a 3.16× gap to the 1.157 bills/sec steady-state SLO, and a 6.57× gap to the 2.41 bills/sec peak-burst SLO. A single L4 cannot meet national-scale production requirements at Llama quality. This is the hardware procurement signal, not a model disqualification.

Integrated recommendation

Recommendation Model: Llama 3.1 8B AWQ · Chunk size: 1,200 tokens · Concurrency: 12–24 (GPU-utilization sweet spot; quality flat across the range) · Hardware: Current single L4 is sufficient for the existing Illinois workload (~11,000 bills/week). National-scale expansion requires approximately 3× the current GPU capacity for steady-state, 7× for peak-burst. The methodology ports directly to new hardware — run the same 54-cell concurrency sweep and compare cell-by-cell.

◆ Known limitation — Qwen in the concurrency sweep

Experiment 2 tested Qwen at chunk_size=500, not its actual quality-best of 3,000. This was because Qwen's quality-best was not yet confirmed when the concurrency sweep was designed — at that point, qwen_500 appeared to be the best Qwen configuration by parse-OK. The Exp 1b extended sweep later confirmed qwen_3000 as the quality peak, and Exp 3.5 confirmed that Qwen's automated proxy actively misranks its own chunk sizes. A follow-up Qwen concurrency sweep at chunk_size=3,000 would complete the picture for Qwen — but since Qwen is eliminated on proxy-reliability grounds regardless, this is a completeness exercise rather than a decision-relevant one.

Why not Qwen? Why not Mistral?

The recommendation is Llama not because the other models are bad, but because they each fail at a specific stage where Llama holds.

Mistral 7B v0.3 AWQ

Eliminated at Stage 4 (concurrency scaling). Mistral's 3,000-token prompts create disproportionate KV cache pressure at higher concurrency. p99 latency grows from 15s at c=1 to 74s at c=24 — a 5× degradation. Every other model degrades 2–3×. The cliff appears between c=4 and c=8, which is exactly the concurrency range a production pipeline needs to operate at to approach the SLO.

Mistral does reach 72% parse-OK at chunk_size=3,000 or 5,000, and its substantive quality on self-contained bills is excellent. It is not a bad model — it is a bad fit for a pipeline that requires high-concurrency batching.

Qwen 2.5 7B AWQ

Eliminated at Stage 1 (quality) and Stage 3 (proxy reliability). Qwen's quality-best is 67.7% parse-OK at chunk_size=3,000 — 24.8 percentage points behind Llama's 92.5%. That gap is larger than Qwen's entire parse-OK range across all chunk sizes tested. On top of that, Qwen's automated quality signal actively misranks its own chunk sizes: qwen_3000 (the parse-OK peak) emits structurally valid JSON with null-filled fields, while qwen_2000 scores higher on every substantive dimension.

Qwen's raw throughput is impressive — fastest chunks/sec of the three models. But throughput at low quality is not the production metric. Effective throughput (parse-OK × chunks/sec) puts Llama ahead: 0.313 vs Qwen's 0.284.

◆ "Start small, scale to fit"

The production recommendation is a 7–8B quantized model on a single GPU. Not a frontier model. Not a multi-GPU cluster. The smallest capable model at its optimal operating point, on hardware sized to the actual workload. The SLO gap tells us exactly how much hardware growth the workload requires — and that is a procurement decision the data now supports rather than one made on assumption. When the new server arrives, run the same sweep, compare cell-by-cell, and let the measurements determine the upgrade value.

07 / Lessons

What we learned (honestly)

The model recommendation is the deliverable. The methodology is the contribution. Six lessons from this evaluation that transfer to any LLM inference benchmarking work.

The prior recommendation was right — and the methodology change is why it changed

The original evaluation (March 2026, Ollama backend) selected Mistral 7B at chunk_size=500 as the production model. That was the correct call given the comparison set and measurement stack available at the time.

Two things changed in this evaluation that together flipped the recommendation to Llama:

Stack upgrade: Ollama → vLLM. Ollama serves one request at a time and cannot expose per-stream TTFT, inter-token latency, prefill/decode decomposition, or real concurrency behavior. Moving to vLLM 0.6.6 made those measurements possible for the first time. Concurrency behavior — the most production-relevant dimension — was simply invisible before.

Model upgrade: Llama 3.2 3B → Llama 3.1 8B. The original comparison included Llama 3.2 3B alongside Mistral 7B and Qwen 2.5 7B. That is not a fair contemporary comparison — 3B vs 7-8B is a parameter-count mismatch, not a model-family comparison. Upgrading to Llama 3.1 8B for parameter parity was the right methodological choice. It also changed the answer: Llama 3.1 8B at 1,200 tokens now clearly leads on quality, where Llama 3.2 3B had been quality-comparable to Mistral at 500 tokens.

The lesson: the answer you get depends on the question you can ask. A better measurement stack asks better questions. The original March 2026 Ollama evaluation ↗ is available for direct comparison.

"Looks fine" is not "is fine" — verify before you trust exit codes

During the extended sweep, a series of launch failures each exited with code 0 and produced output files of plausible size. The sweep appeared to be running. It wasn't. The harness was failing open — producing empty or near-empty results while reporting success — because the failure mode (a tokenizer cache miss caused by an SSL handshake failure to an external host) happened before the extraction loop, not inside it.

The fix was found the same way every real production bug is found: checking file sizes, wall times, and chunks_written counts rather than trusting the exit code. A 3-second wall time on a cell that should take 5 minutes is the signal. A 761-byte result file where a real result is 4–5 KB is the signal.

The rule: in performance work, per-cell verification is mandatory. The harness can succeed at producing nothing. Build verification into the sweep design, not as an afterthought.

Parse-OK is necessary but not sufficient — and its reliability is model-dependent

Parse-OK measures structural validity: did the model produce well-formed JSON with all fields populated? It cannot measure whether those values are accurate, specific, or complete. A model can pass parse-OK with all 11 fields confidently populated — and have every answer be about the wrong section of the bill.

Experiment 3 confirmed that parse-OK reliably ranks models at their respective quality-best configurations. That cross-model reliability is real and useful. Experiment 3.5 found that within a single model, parse-OK fails to rank chunk sizes reliably — and the failure mode differs by model:

Llama: directional but understated — parse-OK correctly identified the better chunk size but understated the quality gap by roughly 14×
Qwen: actively misranking — the parse-OK peak emits structurally valid JSON with null-filled fields, scoring higher than configurations with substantively better output
Mistral: mildly reversed — the parse-OK peak scores lower on substance due to off-section drift on operative chunks

The rule: don't generalize proxy reliability findings across models. Validate the proxy within each model family before using it for tuning decisions.

"Quality-best chunk size" is incomplete without asking "best for what corpus?"

The single-cell quality-best (peak parse-OK in the controlled sweep) is a useful starting point, but it answers a narrow question: which chunk size performs best on a uniform sample of 4,500–5,000 token bills? The production corpus is not uniform. Its median bill is 1,051 tokens. 68% of bills fit under 2,000 tokens. 29% are under 500 tokens.

Three lenses on "best" all agree on Llama in this case, but the gaps differ: 20.5pp at single-cell best, 27.5pp at median-bill chunking, 13pp at P75 chunking. In a different corpus or with different models, the lenses might not agree — and the engineer who only checked the single-cell number would miss that the recommendation is corpus-dependent.

The rule: always test quality claims against the production token-length distribution, not just the controlled eval sample. The corpus shape determines the chunking strategy that matters.

The chunk_size=4,000 dip is a corpus-length artifact, not a model limitation

When the extended sweep showed Mistral's parse-OK dropping from 72% at 3,000 tokens to 59.8% at 4,000 tokens, the first instinct is to conclude that 4,000-token chunks are harder. They're not — not in general. The eval sample happens to consist of bills in the 4,500–4,972 token range. At chunk_size=4,000 those bills split into one full 4,000-token primary chunk and one short 500–900-token tail chunk. The tail chunk has insufficient content to populate all 11 fields and correctly returns parse_partial.

At chunk_size=5,000 the same bills become single chunks. Parse-OK recovers to exactly 72.0%. The "dip" was a chunk-boundary interaction with the corpus length distribution — not a model failure, not a methodology flaw, just an artifact of how the eval sample happened to align with the chunk boundary.

The rule: when you see an unexpected performance cliff, check whether the eval corpus distribution is interacting with the configuration boundary before concluding the model has a capability limitation.

"Start small, scale to fit"

The production recommendation is a 7–8B quantized model on a single L4 GPU. Not a frontier model. Not a multi-GPU cluster. Not the largest available option. The smallest capable model at its optimal operating point, on hardware sized to the actual measured workload.

The SLO gap (3.16× on steady-state) is not a failure of this approach — it is the output of it. The gap is now measured, not assumed. The hardware procurement decision that follows is grounded in data: a 3× scaled hardware solution clears the steady-state SLO at Llama quality. Without this evaluation, that number would have been a guess. With it, it is a specification.

Smaller models also leave VRAM headroom for concurrent workloads on shared infrastructure. A 7-8B AWQ model uses approximately 4 GB VRAM at idle, leaving 18 GB for KV cache, concurrency headroom, and co-resident pipelines. The new server won't serve only this pipeline. That headroom has real operational value.

The principle: characterize the workload before sizing the hardware. Pick the smallest capable model that meets quality and throughput requirements. Reserve larger models and bigger hardware for the workload growth they actually serve — not for the workload you imagine you might have someday.

08 / Experiment 2

Concurrency sweep — finding the knee

Experiment 1 answered which model. Experiment 2 answered at what concurrency — how much speedup each model delivers as requests scale up, where the diminishing-returns knee sits, and whether quality holds under load.

Design

Three models at each model's quality-best chunk size, six concurrency levels (1, 4, 8, 12, 16, 24), three repeats per cell. Same 50-bill controlled sample and warm-start protocol as Experiment 1.

Profile	Chunk size	Reason for inclusion
`llama31_1200`	1,200 tokens	Quality champion from Exp 1 (92.5% parse-OK)
`qwen_500`	500 tokens	Throughput champion candidate — note: qwen_3000 is Qwen's actual quality-best; see Exp 2b
`mistral_3000`	3,000 tokens	Incumbent baseline at quality-best chunk size

Throughput scaling — all profiles

Throughput in chunks/sec as a function of concurrency. All three models scale meaningfully. Llama reaches 8.49× speedup at c=24, Qwen 11.76×, Mistral 4.93×. Dashed lines mark each profile's production SLO target.

Speedup relative to single-stream baseline

Same data normalized to conc=1. Qwen comes closest to linear scaling (small chunks expose more parallelism). Llama scales smoothly. Mistral flattens early — its 3,000-token prompts are prefill-heavy.

◆ The disqualifying finding — Mistral's p99 tail

Mistral's p99 latency grows from 15s at c=1 to 74s at c=24 — a 5× degradation. The cliff appears between c=4 and c=8, exactly the range needed to approach the production SLO. This is head-of-line blocking from KV cache contention: 3,000-token prompts demand more cache than the GPU can serve without preempting queued requests. A 74-second p99 is operationally unacceptable for a production batch pipeline.

p99 latency vs concurrency

Mistral's tail explodes 5× from c=1 to c=24. Qwen stays flat (9.5s → 18.8s). Llama falls in the middle (14s → 39s) — usable, the expected production operating range.

Quality vs concurrency — the production-defining finding

Quality does not degrade as concurrency increases. Maximum variation: 0.6pp (Llama 3.1: 91.0%→92.5%). The operating point can be chosen purely on throughput/latency grounds — no quality headroom needs to be budgeted against queue depth.

Experiment 2b — Qwen 3000 at its actual quality-best

The original concurrency sweep tested qwen_500 because that was Qwen's apparent quality-best at the time. Exp 1b later confirmed qwen_3000 as Qwen's actual quality-best. This follow-up sweep completes the picture.

qwen_3000 concurrency profile — averaged across 3 repeats per cell

Concurrency	Parse-OK	Chunks/sec	p50	p99	Speedup vs c=1
1	68.0%	0.145	6.68s	10.33s	1.00×
4	67.7%	0.421	9.06s	14.81s	2.91×
8	67.7%	0.630	11.80s	21.30s	4.34×
12	68.0%	0.794	14.07s	27.72s	5.48×
16	67.7%	0.947	15.63s	29.99s	6.53×
24	71.3%	1.157	19.04s	31.06s	7.98×

● What changed vs qwen_500

At chunk_size=3,000 Qwen's quality improves from 62.6% to 71.3% parse-OK — a real improvement. The p99 tail is slightly worse (31s vs 19s at c=24) but still far better than Mistral's 74s. Throughput is lower in chunks/sec because 3,000-token chunks are larger, but since bills become fewer chunks, bill-level throughput is comparable.

◆ What didn't change

Quality is still invariant under concurrency — 67.7–71.3% across all levels. And at 71.3% peak parse-OK, qwen_3000 still sits 21 percentage points below Llama at 92.5%. The quality gap that disqualifies Qwen in Stage 1 of the funnel is unchanged. Effective throughput (quality × chunks/sec) at c=24: 0.825 — marginally higher than Llama's 0.803, but at dramatically lower quality.

● The production-defining finding

Quality is invariant under concurrency for all three models. This eliminates an entire class of production risk: concurrency tuning is a hardware-utilization problem, not a quality trade-off problem. The right operating point for Llama is the one that maximizes throughput within the latency SLO — currently concurrency=12–24 on the L4.

09 / SLO Gate

Production SLO gate — the honest answer

The experiments characterize the model+hardware combination. This section answers whether it meets production requirements — using the real corpus distribution, not the controlled eval sample.

The production workload contract

~75,000 bills/week at 50-state scale (50 states × ~1,500 bills/week peak). Processing window: 3 nights × 6 hours = 18 hours/week.

Required throughput: 1.157 bills/sec sustained · 2.41 bills/sec peak-burst.

The 50-bill eval sample is intentionally uniform (4,500–5,000 tokens). The real Illinois corpus has a median of 1,051 tokens — 29% of bills are under 500 tokens, 12% are over 10,000. Using the eval sample for SLO projection overstates the chunking workload by 1.4×–3.4× depending on profile. The right calculation uses the real corpus distribution.

3.16×

Steady-state gap

Llama @ c=24 vs 1.157 bills/sec SLO

6.57×

Peak-burst gap

Llama @ c=24 vs 2.41 bills/sec peak SLO

~3×

Hardware needed

L4s in parallel to clear steady-state SLO

SLO gap multipliers — all operating points

Per-cell shortfall to steady-state SLO. None of the 18 operating points on a single L4 meets the SLO, but the gap ranges from 2.74× (Qwen at c=24) to 32× (Qwen at c=1). Smallest gap at acceptable quality: Llama at c=24, 3.16× short.

Hardware gap — best operating point per model

Gap multiplier at each model's best operating point (c=24). The dashed line is the SLO pass threshold. A 3× scaled solution (3 L4s or one A100-class GPU) clears the steady-state SLO at Llama quality. Peak-burst requires ~7×.

Pareto frontier — throughput × quality, all operating points

Every (model, concurrency) point plotted by throughput and quality. Per-profile SLO thresholds as vertical dashed lines. Llama dominates the upper portion on quality. The cluster of Llama points between c=12 and c=24 — 91–92.5% quality, within 3–4× of SLO — are the operating points to scale forward.

◆ What this means

The eval is not obsolete — it is the comparison baseline. When the new server arrives, run the same 54-cell concurrency sweep on the new hardware and compare cell-by-cell. The SLO gap multiplier will change; the methodology does not. The model recommendation (Llama) is hardware-agnostic — it is a property of the extraction task, not the GPU.

SLO compliance analysis — measured throughput vs workload contract

All figures derived from measured p99 throughput at the recommended operating point: Llama 3.1 8B AWQ, chunk_size=1,200, concurrency=24, single NVIDIA L4.

Measured sustained throughput: 0.88 chunks/sec · 2.40 chunks/bill (real corpus) → 0.367 bills/sec effective

● SLO PASS — Current production workload

Workload: 11,079 bills/week (Illinois only)
Required throughput: 0.179 bills/sec (11,079 ÷ 604,800s/week)
Measured throughput: 0.367 bills/sec
Headroom: 2.05× above SLO threshold — single L4 is sufficient with capacity to absorb session-to-session variance and co-resident workloads.

✗ SLO FAIL — 50-state expansion target

Workload: 75,000 bills/week (50-state peak)
Required throughput: 1.157 bills/sec sustained · 2.41 bills/sec peak-burst
Measured throughput: 0.367 bills/sec
Deficit: 3.16× below steady-state SLO · 6.57× below peak-burst SLO
Projected processing time at measured throughput: ~56.6 hours against an 18-hour window.

Capacity planning requirement to close the steady-state deficit: minimum 3.16× throughput increase — achievable via horizontal scaling (~3 L4s in parallel) or vertical scaling (single A100-class GPU, which delivers ~3–4× the L4's inference throughput for this model class). Peak-burst compliance requires ~6.6× — ~7 L4s or H100-class hardware. These are now hardware procurement specifications derived from measurement, not estimates.

09 / Optimization Paths

Eliminated — but not without recourse

Each eliminated model failed at a specific, measurable stage. The telemetry tells us not just why each model failed, but what specific interventions would be required to rehabilitate it. This is performance engineering applied to model selection.

◆ The principle

A model isn't categorically bad — it's bad at a specific operating point. The measurements tell us which knob to turn. Whether the cost of turning that knob is worth it is a separate, explicit decision. Documenting the path matters because the pipeline will evolve: a future workload change, hardware upgrade, or prompt improvement might make a previously-eliminated model viable.

Mistral 7B — eliminated at Stage 4 (concurrency)

Root cause (measured): KV cache head-of-line blocking. At chunk_size=3,000 each prompt occupies ~3,000 × 2 bytes × 32 layers of KV cache. At concurrency=8+, the L4's 22.5 GB VRAM cannot hold all in-flight KV states simultaneously, triggering preemption. Measured signal: p99 grows 5× (15s→74s), cliff between c=4 and c=8.

Optimization path 1 — Reduce chunk size to 2,000 tokens. Our Exp 3.5 data shows mistral_2000 scores higher on quality (1.97 vs 1.82) than mistral_3000. Smaller chunks = smaller KV states = less cache pressure. Test: rerun concurrency sweep at chunk_size=2,000 and measure whether the p99 cliff moves out past c=16. If it does, Mistral becomes viable.

Optimization path 2 — vLLM swap space tuning. --swap-space controls CPU RAM offload when VRAM KV cache fills. Increasing from the default 4 GB to 16+ GB may absorb the burst without blocking the queue. Test: run the concurrency sweep with --swap-space 16 and compare p99 curves.

Optimization path 3 — Chunking strategy. If the real corpus has a higher proportion of short bills (<1,500 tokens), those bills never produce 3,000-token chunks and Mistral's KV issue never triggers. Measure corpus chunk-length distribution at chunk_size=2,000 and calculate what fraction of production traffic would hit the KV ceiling.

Qwen 2.5 7B — eliminated at Stage 1 (quality) and Stage 3 (proxy)

Root cause (measured): Two distinct failure modes. (1) Off-section drift at chunk_size=3,000: the model anchors on adjacent bill sections, producing specific answers about the wrong content. (2) None-emission: qwen_3000 returns parse_status=OK with all 11 fields set to "None." — structurally valid but substantively empty. The parse-OK proxy actively misranks these configurations.

Optimization path 1 — Prompt engineering for None-emission. The all-None failure is a prompt compliance failure, not a model capability failure. Adding explicit field-completion requirements ("every field must contain substantive extracted content; return 'not mentioned' rather than null") and two-shot examples of complete extractions may eliminate this failure mode. Test: rerun Exp 3.5 Pair 2 (Qwen 2000 vs 3000) with a v2 prompt and measure whether None-emission drops below 5%.

Optimization path 2 — chunk_size=2,000 as the operating point. Exp 3.5 shows qwen_2000 scores 1.93 quality vs qwen_3000's 1.57 — the best Qwen quality observed. Running a full concurrency sweep at qwen_2000 would complete the picture: does qwen_2000's tighter window also improve parse-OK rate and eliminate the proxy-reliability problem? If parse-OK at qwen_2000 exceeds qwen_3000's 67.7%, the proxy becomes reliable for within-Qwen tuning.

ITL signal (measured): Qwen's ITL p99 of 118ms vs Llama's 49ms indicates spiky decode behavior. This is likely caused by Qwen's tokenizer producing more variable-length outputs. Capping max_tokens at a lower value (e.g., 768 instead of 1,024) may stabilize ITL at the cost of occasionally truncating long responses.

Qwen 2.5 14B — eliminated at Experiment 4 (quality + throughput)

Root cause (measured): Output budget pressure. At 5,000 input tokens + prompt overhead, insufficient token budget remains for complete 11-field JSON output. Evidence: 54% parse_partial, mean 1.2 missing fields, 0 failed extractions. The model is not failing — it is running out of output space.

Optimization path 1 — Increase max_tokens. Current config: max_tokens=1,024. A complete 11-field extraction averages ~185 tokens at 7B and likely more at 14B. Increasing to 2,048 gives ~2× more output headroom. Test cost: roughly 40–60% slower per call (more decode steps). Run 3 cells at max_tokens=2,048 and measure whether parse_partial drops below 20%.

Optimization path 2 — Reduce to chunk_size=2,000. Smaller input = more output budget remaining. This is the same observation as for Qwen 7B: chunk_size=2,000 may be the sweet spot for all three eliminated models. At 2,000 tokens the 14B model has ~3,000 tokens of output headroom — more than enough for 11 fields. Test: rerun Exp 4 at chunk_size=2,000 with the existing qwen14b chunker config.

Throughput note: Even if quality is fixed, 14B decode is inherently slower per token than 8B. The effective throughput ceiling at concurrency=4 is ~0.15–0.20 chunks/sec regardless of quality improvement. Unless decode speed improves with hardware, the 14B model will always be 2–3× slower than Llama — a trade-off that requires explicit justification.

The untested sweet spot — chunk_size=2,000

All three eliminated models point to the same untested configuration: chunk_size=2,000. For Mistral it reduces KV cache pressure. For Qwen 7B it eliminates the off-section drift and proxy-reliability failures. For Qwen 14B it recovers output budget. The original sweep skipped from 1,200 to 2,000 and used 2,000 only as a mid-point, never as a primary operating point for any model.

A focused 9-cell experiment — 3 eliminated models × chunk_size=2,000 × 3 repeats — would complete the picture for all three at once. If Mistral_2000 clears the p99 threshold and qwen_2000's proxy becomes reliable, the competitive landscape changes. The Llama recommendation likely holds, but the margin would be explicitly measured rather than inferred.

This is Experiment 5 in the planned pipeline: targeted chunk_size=2,000 sweep, all three eliminated models, full concurrency characterization.

10 / Experiment 4

Does a larger model at full-bill context beat chunking?

The hypothesis: Qwen 2.5 14B AWQ at chunk_size=5,000 — each eval bill as a single chunk, no splitting, no boundary artifacts — would outperform the 7B at 3,000 tokens. The result was the opposite.

Design

Model: Qwen 2.5 14B Instruct AWQ — 14B parameters, ~9 GB VRAM

Chunk size: 5,000 tokens — each eval bill (4,500–4,972 tokens) becomes a single chunk. One inference call per bill. No splitting. No post-processing aggregation.

Concurrency: 4 · Repeats: 3 · Bills: 50 · Total runtime: 16 minutes

Chunks created by concatenating qwen_3000 chunk pairs back to full-bill text and writing directly to silver — bypassing the broken chunker CLI.

The hypothesis

Off-section drift — the dominant failure mode in Experiments 3 and 3.5 — occurs because a larger chunk gives the model more bill context than it needs, causing it to anchor on adjacent sections. The natural fix: give it the whole bill as a single chunk, eliminating the section-boundary problem entirely.

A 14B model also has more capacity to handle complex multi-section bills. The expectation was meaningfully higher parse-OK — potentially approaching Llama's 92.5%.

◆ Result: The larger model performed worse

45.0% parse-OK — the lowest of any configuration tested. 47.5pp below Llama's 92.5%, and 22.7pp below the 7B Qwen at 3,000 tokens. Doubling parameters while eliminating chunking made quality worse, not better.

Fig · Final model comparison — all profiles at concurrency=4

What the data actually shows

54% parse_partial — not failure, but incompleteness

The 14B model produced 0 failed extractions. It's not crashing — it's returning valid JSON with incomplete field coverage. Mean missing fields: 1.2 out of 11. The model gets most fields but consistently fails to complete all 11 within the output token budget.

At 5,000 input tokens + prompt template overhead, the context window is heavily loaded. The model appears to be hitting output budget pressure — the prompt leaves insufficient room for a complete 11-field JSON response on every bill.

Throughput: slowest of all configurations

0.156 chunks/sec at concurrency=4 — less than half Llama's 0.338. p50 latency of 24.9 seconds per call vs Llama's 6.7 seconds. Since each bill is now one chunk (vs 2.4 chunks at chunk_size=1,200), bill-level throughput is 0.156 bills/sec vs Llama's 0.141 bills/sec — roughly comparable, but at less than half the quality.

The 14B model decodes more slowly per token than the 8B — AWQ quantization compresses the weight delta but not the fundamental decode cost of a larger model.

SLO projection — Qwen 14B at production scale

0.156

Bills/sec @ c=4

1 chunk/bill → throughput = chunks/sec

7.43×

Gap to steady-state SLO

vs 3.16× for Llama — more than 2× worse

~134h

To process 75,000 bills

vs ~56.6h for Llama · 18h window available

At 0.156 bills/sec, processing 75,000 bills would require ~134 hours — 7.4× the available 18-hour window, and more than twice the gap Llama faces. The 14B model is eliminated on both quality and throughput grounds.

Why chunking + smaller model wins

● The architectural lesson

The failure of the 14B full-bill approach reveals something important: structured extraction at scale favors tight context windows over large models. Llama 3.1 8B at 1,200 tokens succeeds because each chunk is small enough that the model can focus precisely on its content and produce complete, accurate field values within the output budget. A 5,000-token bill floods the model's attention across the entire document — it sees everything but completes nothing reliably.

The off-section drift problem (Exp 3.5) is real, but the solution is not to eliminate chunking — it's to choose the right chunk size for the task. At 1,200 tokens, Llama stays on-chunk. The chunking artifacts that motivated this experiment are a configuration problem, not an architectural one.

Full comparison — all tested configurations

Profile	Model	Chunk size	Parse-OK	Chunks/sec	p50	p99	Quality	Verdict
`llama31_1200`	Llama 3.1 8B	1,200	92.5%	0.338	6.7s	21s	1.70	✓ PRODUCTION
`mistral_3000`	Mistral 7B	3,000	72.0%	0.212	22.4s	74.1s	1.53	✗ p99 explosion
`qwen_3000`	Qwen 2.5 7B	3,000	67.7%	0.421	9.1s	14.8s	1.57	✗ proxy misranks
`qwen_500`	Qwen 2.5 7B	500	62.6%	0.706	9.8s	9.8s	1.58	✗ quality ceiling
`qwen14b_5000`	Qwen 2.5 14B	5,000	45.0%	0.156	24.9s	33.4s	TBD*	✗ worst quality tested

* Qualitative A/S/C review for qwen14b_5000 scheduled as follow-up. Parse-OK of 45% makes the outcome predictable — not prioritized before production deployment.

◆ Final conclusion

The production recommendation is confirmed and strengthened by Experiment 4. Llama 3.1 8B AWQ at chunk_size=1,200, concurrency=12–24 leads on every dimension that matters: parse-OK quality (92.5%), qualitative scoring (1.70/2.0, 60/60 confirmed), latency tail (39s p99 at c=24), and proxy reliability (directional within-model). A model nearly twice its size, given full-bill context, could not match it. The methodology is complete. The recommendation stands.

12 / What's Next

The experiment pipeline

This evaluation answers the model selection question for the current hardware and scale. The experiments below extend it — to larger models, the full production corpus, and the hardware that doesn't exist yet.

✓ Complete

Experiment 3 — Cross-Model Quality Validation

Blind rubric scoring of 60 cells across the three quality-best configurations. Confirmed parse-OK reliably ranks models in the same order as substantive quality. 60/60 cells confirmed, zero revised between passes.

✓ Complete

Experiment 3.5 — Within-Model Proxy Validation

Five within-model pairs scored across chunk sizes bracketing each model's parse-OK peak. Found that parse-OK fails as a within-model proxy for all three models — quality peaks at the smaller chunk size in every informative pair. Characterized the failure mode (off-section drift) and established that parse-OK's reliability is model-dependent. Full scoring available in the Quality Review document ↗.

✓ Complete

Experiment 2b — Qwen 3000 Concurrency Sweep

The original concurrency sweep (Exp 2) tested Qwen at chunk_size=500 — its apparent quality-best at the time. Exp 1b later confirmed qwen_3000 as Qwen's actual quality-best. This follow-up runs Qwen 2.5 7B AWQ at chunk_size=3,000 across the same six concurrency levels (1, 4, 8, 12, 16, 24), 3 repeats each, on the same 50-bill controlled sample. Delivers a directly comparable concurrency profile for Qwen's actual quality-best configuration. Estimated runtime: ~2–3 hours. Results will be incorporated into the concurrency section when complete.

◈ Staged · Ready to run

Experiment 4 — Qwen 2.5 14B AWQ

Model downloaded and staged on the current L4 (C:\models\Qwen2.5-14B-Instruct-AWQ\, 9.29 GB across 3 shards). This is the natural "start small, scale up" next step: double the parameter count within a single-GPU deployable form factor and measure whether quality improves meaningfully.

The specific hypothesis: at 14B parameters, does the model handle more of the real corpus without chunking? The production corpus has a long right tail — 12% of bills exceed 10,000 tokens, and the longest bill is over 1 million tokens. If a 14B model with a larger native context window handles mid-length bills (say, 3,000–10,000 tokens) as single-chunk extractions rather than requiring splits, it may improve both quality (no chunk-boundary artifacts) and throughput (fewer inference calls per bill).

Design: same eval sample, same chunk-size sweep structure, same telemetry stack. Adds one concurrency-sweep cell at the quality-best chunk size found. Results compared cell-by-cell to the 7-8B baseline established here.

Note: throughput will be lower than the 7-8B models — a 14B model decodes slower and demands more KV cache. The interesting question is whether the quality gain justifies the throughput cost, and whether the effective throughput (quality-adjusted chunks/sec) still favors the smaller model.

○ Planned

Experiment 5 — Production-Distribution Validation

The 50-bill controlled eval sample answers "how do these models compare on a stress-test distribution?" The 11,079-bill production corpus answers "how does the chosen model behave on real traffic?" Experiment 5 runs Llama 3.1 8B AWQ at the recommended configuration (chunk_size=1,200, concurrency=12–24) against the full Illinois corpus, with full instrumentation.

Key questions: does parse-OK hold within 2–3pp of the controlled-sample finding across the full length distribution? How do very short bills (<500 tokens, 29% of corpus) behave — do they produce single trivial chunks that inflate parse-OK artificially? How do very long bills (>10,000 tokens, 12% of corpus) behave — does quality degrade as chunk count per bill increases? The deliverable is a production cost model: total chunks, GPU-hours, and projected runtime for the full Illinois corpus and the 50-state target.

○ Planned

Experiment 6 — Adaptive Chunking Strategy

All experiments so far use a uniform chunk size across all bills. But the production corpus has a bimodal-ish length distribution — 29% of bills are under 500 tokens (fit in one small chunk), 12% are over 10,000 tokens (require many chunks at any reasonable size). The hypothesis: routing different bill-length buckets to different configurations may outperform any single uniform strategy on corpus-weighted parse-OK.

A simple adaptive policy: bills under 1,500 tokens → extract as a single chunk (no splitting); bills 1,500–5,000 tokens → chunk at 2,000 tokens; bills over 5,000 tokens → chunk at the standard 1,200 tokens with overlap. Each tier uses the model configuration optimized for that range. This eliminates tail-chunk artifacts for medium-length bills and avoids over-fragmenting short bills.

Design complexity is higher — requires tiered configs and a routing layer — but the potential quality improvement for the 40% of bills that fall in the middle range is worth measuring.

○ Planned · New hardware required

Experiment 7 — Repeat on Production Hardware

The methodology in this study transfers cleanly to any hardware. The absolute throughput numbers do not — they are specific to the current NVIDIA L4 (22.5 GB VRAM, 72W TDP). When the new AI Solutions Team server arrives, run the same 54-cell concurrency sweep and compare cell-by-cell. The before/after comparison quantifies the hardware upgrade's actual value rather than estimating it from spec sheets.

The SLO gate calculation updates automatically: plug the new measured throughput into the same real-corpus chunks-per-bill numbers and the gap multiplier resolves to whatever the new hardware supports.

◆ The methodology is the asset

The model recommendation (Llama 3.1 8B AWQ at chunk_size=1,200, concurrency=12–24) is the immediate output. The evaluation methodology — controlled-variable sweep design, per-request telemetry, defensive automation, qualitative validation, SLO projection from real corpus distribution — is the durable contribution. It ports to new models, new hardware, and new workloads without redesign. Every future experiment in this pipeline runs against the same baseline.

Selecting a production LLM for legislation extraction

Built to scale — evaluated to that standard

The business problem

Controlled-variable design

The eval sample

Three models

Best-vs-best principle

Instrumentation

Quality is not one number

The extraction schema

Three independent quality signals

Parse-OK across chunk sizes — Experiments 1 and 1b

The "best for what?" framing — production-weighted quality

Experiment 3 — Cross-model qualitative validation

Method

The finding

Experiment 3.5 — Within-model proxy validation

Performance results

Latency decomposition — where the time goes

GPU telemetry — the hardware stayed healthy

The production recommendation

The six-stage funnel

Why not Qwen? Why not Mistral?

Mistral 7B v0.3 AWQ

Qwen 2.5 7B AWQ

What we learned (honestly)

The prior recommendation was right — and the methodology change is why it changed

"Looks fine" is not "is fine" — verify before you trust exit codes

Parse-OK is necessary but not sufficient — and its reliability is model-dependent

"Quality-best chunk size" is incomplete without asking "best for what corpus?"

The chunk_size=4,000 dip is a corpus-length artifact, not a model limitation

"Start small, scale to fit"

Concurrency sweep — finding the knee

Design

Experiment 2b — Qwen 3000 at its actual quality-best

qwen_3000 concurrency profile — averaged across 3 repeats per cell

Production SLO gate — the honest answer

The production workload contract

SLO compliance analysis — measured throughput vs workload contract

Human-equivalent scoring

Eliminated — but not without recourse

Mistral 7B — eliminated at Stage 4 (concurrency)

Qwen 2.5 7B — eliminated at Stage 1 (quality) and Stage 3 (proxy)

Qwen 2.5 14B — eliminated at Experiment 4 (quality + throughput)

The untested sweet spot — chunk_size=2,000

Does a larger model at full-bill context beat chunking?

Design

The hypothesis

What the data actually shows

54% parse_partial — not failure, but incompleteness

Throughput: slowest of all configurations

SLO projection — Qwen 14B at production scale

Why chunking + smaller model wins

Full comparison — all tested configurations

The experiment pipeline