Six-stage evaluation funnel across three open-weight 7–8B AWQ models on a single NVIDIA L4 GPU. Chunk-size sweep, concurrency scaling, qualitative validation, SLO projection. Output: a defensible model recommendation and a quantified hardware gap to national-scale production.
01 / Summary
A six-stage evaluation funnel to select a production LLM for structured legislation extraction — designed to make the right choice once, so the pipeline can scale to 50 states without revisiting the foundation.
The Illinois Legislation Pipeline ingests state bills, chunks them, and runs LLM extraction to populate 15 structured intelligence fields used by downstream analysts, policy researchers, and executive stakeholders. It had been running on Ollama with Mistral 7B on a single NVIDIA L4 GPU. As the platform grew toward a 50-state expansion, three pressures converged: the hardware was running at 28% utilization, the serving architecture couldn't support the concurrency needed for national scale, and two newer open-weight models had become available since the original deployment.
The goal was not to run a benchmark. It was to make a production decision that would hold — selecting the best model at the right operating point on current hardware, with a quantified understanding of what scaling to national volume requires, so that when more states are added the model choice doesn't become a liability.
That meant evaluating quality three ways, not one. Parse-OK rate (structural validity) is fast and scalable but blind to whether the extracted values are actually correct. Blind qualitative review closes that gap for cross-model comparison. Within-model proxy validation tests whether parse-OK can be trusted for tuning decisions inside a model family — and finds that it cannot, for any of the three models tested.
It meant measuring concurrency behavior, not just single-stream throughput — because the throughput that matters for production is the throughput you can sustain at the concurrency required to approach SLO. And it meant projecting the SLO against the real corpus distribution, not the controlled eval sample, so the hardware gap is a measured number rather than a guess.
Every claim in this case study is backed by a measured number from a controlled experiment. The methodology ports to any future hardware — run the same sweep, compare cell-by-cell, and the upgrade value is quantified rather than asserted.
02 / Problem
Why re-evaluate the production model — and why now.
The Illinois Legislation Pipeline ingests state bills, chunks them, and runs LLM extraction to populate 15 structured fields used by downstream analysts, policy researchers, and executive stakeholders. It had been running on Ollama with Mistral 7B on a single NVIDIA L4 GPU. Three pressures converged to motivate re-evaluation.
Resource utilization. Production GPU metrics showed Ollama exercising only about 28% of the L4's compute capacity. The hardware was running idle more than it was working, and that headroom became impossible to ignore as national-scale expansion entered planning.
Throughput ceiling. For a 50-state rollout, the pipeline needs approximately 1.16 bills/sec sustained and 2.41 bills/sec at peak burst. Ollama's serial serving — one request at a time — wasn't close. Continuous batching was required.
Newer model options. Qwen 2.5 7B and Llama 3.1 8B had become available since the original Mistral deployment. If either extracted better on the 15-field schema, the upgrade cost was minimal.
The goal was to defend or replace the production decision before committing GPU time to a full-corpus backfill. The wrong model choice would mean weeks of regenerating bad data at scale.
◆ The prior recommendation
The March 2026 evaluation (Ollama backend, Llama 3.2 3B comparison set) selected Mistral 7B at chunk_size=500 as the production model. That was the correct call given the measurement stack available at the time. This evaluation supersedes it — with vLLM's full per-stream telemetry, parameter-parity models, and real concurrency data. See Section 07 for what changed and why it changed the answer.
GPU utilization across the full sweep. vLLM at concurrency=4 drives 90-94% sustained utilization versus Ollama's ~28%. Same hardware, same model class.
03 / Method
The hardest part of LLM evaluation is keeping the comparison fair.
50-bill controlled sample. All bills are 4,500–4,972 Mistral tokens — chosen to sit at the production P95 of bill length without exceeding any model's prompt budget at chunk_size=3000. Cross-cell differences reflect model and chunk-size behavior, not input distribution variance.
Source: bronze session 999995, random_state=42. Every cell in both sweeps uses the identical 50 bills.
Mistral 7B Instruct v0.3 AWQ — production incumbent
Qwen 2.5 7B Instruct AWQ — instruction-following optimized
Llama 3.1 8B Instruct AWQ — ~14% more parameters
Three model families. Same quantization tier. All fit on the L4's 22.5 GB VRAM with concurrency headroom.
Each model is measured at its own quality-best chunk size — not a shared chunk size. This is the Olympic swimmer principle: each model competes at its own optimum. Forcing all three to the same context window would make the comparison unfair rather than controlled.
Per-request: TTFT, ITL p50/p99, decode throughput, prefill throughput, token counts. Server-side GPU sampling at 10-second intervals. Safety alarms for temperature, KV cache, and memory. None fired.
vLLM 0.6.6 · AWQ-Marlin · --max-model-len 8192 · prefix caching · 5-call warmup per cell.
● Reproducibility
3 repeats per cell. Maximum cross-repeat variance: <0.5% throughput, <1.5% p99 latency, <0.3pp quality. The numbers in this case study are measurements with quantified noise floors, not estimates. Any subsequent test on different hardware compares cell-by-cell against this baseline.
04 / Quality
A high parse rate tells you the model produced valid JSON. It does not tell you whether the values are right, specific, or complete. This section presents three independent lenses — structural validity, substantive quality, and within-model proxy reliability — and shows that Llama leads on all three.
Each bill chunk is extracted into 15 structured fields covering legislative goal, key provisions, beneficiaries, fiscal impact, regulatory changes, and more. Four fields are binary or very short-answer — yes/no flags and single-sentence outputs where parse-OK is essentially the full quality signal. The remaining 11 substantive fields require reasoning, specificity, and completeness and are the ones that matter for downstream analysis. Those 11 are what the qualitative review scores.
1. Parse-OK rate — structural validity check across all 15 fields. Fast, automated, scalable. Necessary but not sufficient.
2. Qualitative review (Exp 3) — blind rubric scoring of 60 cells across the three models' quality-best configurations. Confirms parse-OK ranks models correctly.
3. Within-model proxy validation (Exp 3.5) — tests whether parse-OK reliably guides chunk-size optimization within a single model. Finds it does not — and characterizes exactly how and where it fails.
Original six-cell sweep (500–3,000 tokens) plus extended sweep testing whether Mistral and Qwen quality continued climbing past their Exp 1 peak. Llama not extended — its peak at 1,200 was already clear.
parse_partial for them. At chunk_size=5,000 the same bills become single chunks, and Mistral's parse-OK recovers to exactly 72.0% — matching its chunk_size=3,000 result."Quality-best chunk size" is incomplete without the question: best for what corpus? The eval sample uses 4,500–5,000 token bills. The real Illinois corpus looks very different.
With a median bill length of 1,051 tokens, the corpus is heavily skewed toward short bills. Three lenses on "quality-best" all point to Llama:
● The production-weighted finding
Llama leads by 13 to 27.5 percentage points depending on which lens you apply. Mistral and Qwen require you to argue for a specific "best for what?" framing to make their case. Llama wins regardless of which framing you choose. That consistency is what makes the recommendation defensible rather than contingent on methodology choices.
Parse-OK is a structural check. Experiment 3 replaced it with substantive scoring: does the model extract the right information, specifically, and completely? Three profiles at their quality-best chunk sizes, 20 chunks each, scored by Claude Opus 4.8 on a 0–2 rubric across Accuracy, Specificity, and Completeness.
60 cells (3 profiles × 20 chunks × 3 dimensions) scored in two independent passes — Pass 1 at high effort, Pass 2 at max effort with fresh context and no access to Pass 1 scores. The 11 substantive fields were scored; the 4 binary/short-answer fields were excluded.
Result: 60/60 agreement between passes. Zero cells revised. Parse-OK reliably ranks the three models in the same order as substantive qualitative scoring.
Context-window size interacts with extraction fidelity. The 3,000-token Mistral config repeatedly imports content from adjacent sections of the same bill and loses the chunk's headline. The 1,200-token Llama config stays on-chunk most reliably. The 500-token Qwen config is faithful but too compressed to be complete.
On self-contained single-amendment bills all three models converge near 2.0. On multi-section bills, Mistral's score collapses while Llama holds. This is a property of the configurations, not the models alone.
◆ The chunk-fidelity caveat
Scoring used a chunk-fidelity stance: did the model extract this chunk correctly? At Chunk 8, Mistral's large context window produces a highly specific answer about a different section of the same bill. Under chunk-fidelity that is Accuracy:1. Under a whole-bill summarization task that cell would flip to 2, which would change the configuration ranking. Both passes flagged this as the single methodological decision the result turns on. The full per-chunk review is available in the Quality Review document ↗.
After the extended sweep showed Qwen's parse-OK climbing past 3,000 tokens, a natural question arose: does that parse-OK improvement reflect real quality? Experiment 3.5 tested whether parse-OK reliably guides chunk-size selection within a single model — and found that it does not, for any of the three models.
◆ The finding
In every informative pair, the smaller chunk size scored higher on substantive quality. The parse-OK peak is not the quality peak for Llama, Qwen, or Mistral. The mechanism is consistent: larger chunks cause off-section drift — the model populates all fields with confident, specific answers about an adjacent section of the same bill. Parse-OK validates the JSON shape; it cannot see the mis-reference.
| Model | Pair | parse-OK says | Quality says | Verdict |
|---|---|---|---|---|
| Llama | 900 vs 2000 |
~tie (89.4% vs 88.4%) | 900 wins — 1.92 vs 1.63 | REVERSED — larger gap than parse-OK showed |
| Qwen | 2000 vs 3000 |
3000 wins (63.3% vs 67.7%) | 2000 wins — 1.93 vs 1.57 | REVERSED — qwen_3000 emits valid-JSON nulls |
| Qwen | 3000 vs 4000 |
3000 wins (67.7% vs 62.8%) | 3000 wins — 1.83 vs 1.23 | AGREES — real quality loss past peak |
| Mistral | 2000 vs 3000 |
3000 wins (63% vs 72%) | 2000 wins — 1.97 vs 1.82 | REVERSED — same drift mechanism |
| Mistral | 3000 vs 5000 |
tie (72% vs 72%) | tie — 1.90 vs 1.88 | TIE — synopsis-only chunks; uninformative |
● What this means for the production recommendation
This finding refines but does not change the recommendation. The cross-model parse-OK ranking (Exp 3, 60/60 confirmed) is unaffected — Llama leads across all models by 20+ percentage points. What Exp 3.5 adds is a characterization of parse-OK's within-model reliability: directional for Llama (it correctly identified 900 as better than 2000, even if it understated the gap), actively misranking for Qwen (the parse-OK peak produces null-filled JSON), and mildly reversed for Mistral (no catastrophic failures, but the smaller chunk size scores higher). The production recommendation is the model where parse-OK is most trustworthy, the quality lead is largest, and the within-model proxy direction is consistent: Llama 3.1 8B AWQ at chunk_size=1,200.
Full per-chunk scoring with source text and model outputs: Quality Review — Experiments 3 & 3.5 ↗
05 / Performance
Charts 1–4 established which model is best. Charts 5–8 explain why — and reveal the hidden cost of Qwen's apparent throughput advantage.
Each point is one (model × chunk_size) cell, averaged across 3 repeats. Llama 3.1 forms a distinct upper cluster at 85–92.5% parse-OK. Qwen clusters right (high throughput, mediocre quality). Mistral underperforms on both axes.
Llama leads at every chunk size (85–92.5%, flat). Mistral climbs steeply with chunk size (56–72%). Dashed lines show Exp 1b extended sweep — see Section 04 for the tail-chunk artifact finding at chunk_size=4,000.
Qwen leads raw throughput at every chunk size. But raw throughput is not the production metric — quality-adjusted throughput (parse-OK × chunks/sec) puts Llama ahead: 0.313 vs Qwen's 0.284.
Field-level parse-OK for each model's quality-best configuration. The hardest fields (ideological_alignment, decreasing_aspects) show the largest gaps — Llama leads on exactly the fields that matter most for downstream analysis.
Full per-request telemetry across all three models at their quality-best chunk size, concurrency=4. These measurements were not available in the original March 2026 Ollama evaluation — vLLM's continuous batching architecture exposes them for the first time.
| Profile | TTFT mean | TTFT p99 | ITL p50 | ITL p99 | Decode tok/s | Completion tokens |
|---|---|---|---|---|---|---|
llama31_1200 |
1.8s | 2.4s | 40ms | 49ms | 24.8 | ~185 |
qwen_500 |
0.6s | 0.9s | 82ms | 118ms | 31.4 | ~142 |
mistral_3000 |
3.9s | 5.1s | 36ms | 44ms | 26.1 | ~198 |
◆ Reading the telemetry
TTFT (Time to First Token) scales with prompt length — Mistral's 3,000-token prompt takes 3.9s to prefill vs Qwen's 0.6s at 500 tokens. This is the prefill cost and it's predictable: ~1.3ms per token. ITL (Inter-Token Latency) measures decode stability — how evenly tokens are generated. Qwen's ITL p99 of 118ms vs Llama's 49ms reveals Qwen's decode is spiky. It generates tokens quickly on average but has frequent long pauses, explaining why its throughput looks high but its tail latency is unpredictable. Completion tokens explain Qwen's brief extraction outputs — 142 tokens vs Llama's 185 — the root cause of its Completeness deficit in the qualitative review.
TTFT scales linearly with prompt length — the textbook signature of prefill-bound startup. The ~5× spread from chunk_size=500 to 3,000 tracks the ~6× spread in prompt tokens. Qwen has a slight (~10–15%) prefill edge.
Per-stream decode at concurrency=4. Mistral has a small per-stream advantage, but per-stream rate doesn't determine total throughput — total completion tokens/sec does, and that's where Qwen's verbosity helps it.
Inter-token latency at the 99th percentile. Mistral and Llama stay tightly bounded. Qwen ranges 65–118ms — the hidden cost of its throughput advantage.
Mean completion tokens by chunk size. Qwen's output brevity explains its throughput lead and its completeness deficit — fewer tokens, faster decoding, but less information extracted.
Peak temperature was 70°C, well below the 85°C alarm threshold. The L4 had substantial thermal headroom throughout both sweeps.
KV cache stayed below 12% throughout the chunk-size sweep. Spikes at larger chunk sizes (the 3,000-token cells) are visible but well within limits. This is the empirical evidence that higher concurrency is safe — there is roughly 8× more KV cache headroom than the sweep used.
06 / Decision
Five experiments. One question: which model, at which chunk size, at which concurrency, should run in production? The answer is not a single number from a single chart — it is the model that passes every stage of a six-stage evaluation funnel.
● Production recommendation
Llama 3.1 8B AWQ · chunk_size=1,200 · concurrency=12–24
Quality champion at 92.5% parse-OK. Qualitatively validated on 60 blind-scored cells. Quality invariant under concurrency load (<0.6pp variation across c=1→24). Tightest p99 tail of the three models. The only configuration that passes every stage of the evaluation funnel below.
Each stage answered one question whose output fed the next. A model that fails any stage is eliminated from production consideration — regardless of how it performs on other stages.
◆ Known limitation — Qwen in the concurrency sweep
Experiment 2 tested Qwen at chunk_size=500, not its actual quality-best of 3,000. This was because Qwen's quality-best was not yet confirmed when the concurrency sweep was designed — at that point, qwen_500 appeared to be the best Qwen configuration by parse-OK. The Exp 1b extended sweep later confirmed qwen_3000 as the quality peak, and Exp 3.5 confirmed that Qwen's automated proxy actively misranks its own chunk sizes. A follow-up Qwen concurrency sweep at chunk_size=3,000 would complete the picture for Qwen — but since Qwen is eliminated on proxy-reliability grounds regardless, this is a completeness exercise rather than a decision-relevant one.
The recommendation is Llama not because the other models are bad, but because they each fail at a specific stage where Llama holds.
Eliminated at Stage 4 (concurrency scaling). Mistral's 3,000-token prompts create disproportionate KV cache pressure at higher concurrency. p99 latency grows from 15s at c=1 to 74s at c=24 — a 5× degradation. Every other model degrades 2–3×. The cliff appears between c=4 and c=8, which is exactly the concurrency range a production pipeline needs to operate at to approach the SLO.
Mistral does reach 72% parse-OK at chunk_size=3,000 or 5,000, and its substantive quality on self-contained bills is excellent. It is not a bad model — it is a bad fit for a pipeline that requires high-concurrency batching.
Eliminated at Stage 1 (quality) and Stage 3 (proxy reliability). Qwen's quality-best is 67.7% parse-OK at chunk_size=3,000 — 24.8 percentage points behind Llama's 92.5%. That gap is larger than Qwen's entire parse-OK range across all chunk sizes tested. On top of that, Qwen's automated quality signal actively misranks its own chunk sizes: qwen_3000 (the parse-OK peak) emits structurally valid JSON with null-filled fields, while qwen_2000 scores higher on every substantive dimension.
Qwen's raw throughput is impressive — fastest chunks/sec of the three models. But throughput at low quality is not the production metric. Effective throughput (parse-OK × chunks/sec) puts Llama ahead: 0.313 vs Qwen's 0.284.
◆ "Start small, scale to fit"
The production recommendation is a 7–8B quantized model on a single GPU. Not a frontier model. Not a multi-GPU cluster. The smallest capable model at its optimal operating point, on hardware sized to the actual workload. The SLO gap tells us exactly how much hardware growth the workload requires — and that is a procurement decision the data now supports rather than one made on assumption. When the new server arrives, run the same sweep, compare cell-by-cell, and let the measurements determine the upgrade value.
07 / Lessons
The model recommendation is the deliverable. The methodology is the contribution. Six lessons from this evaluation that transfer to any LLM inference benchmarking work.
The original evaluation (March 2026, Ollama backend) selected Mistral 7B at chunk_size=500 as the production model. That was the correct call given the comparison set and measurement stack available at the time.
Two things changed in this evaluation that together flipped the recommendation to Llama:
Stack upgrade: Ollama → vLLM. Ollama serves one request at a time and cannot expose per-stream TTFT, inter-token latency, prefill/decode decomposition, or real concurrency behavior. Moving to vLLM 0.6.6 made those measurements possible for the first time. Concurrency behavior — the most production-relevant dimension — was simply invisible before.
Model upgrade: Llama 3.2 3B → Llama 3.1 8B. The original comparison included Llama 3.2 3B alongside Mistral 7B and Qwen 2.5 7B. That is not a fair contemporary comparison — 3B vs 7-8B is a parameter-count mismatch, not a model-family comparison. Upgrading to Llama 3.1 8B for parameter parity was the right methodological choice. It also changed the answer: Llama 3.1 8B at 1,200 tokens now clearly leads on quality, where Llama 3.2 3B had been quality-comparable to Mistral at 500 tokens.
The lesson: the answer you get depends on the question you can ask. A better measurement stack asks better questions. The original March 2026 Ollama evaluation ↗ is available for direct comparison.
During the extended sweep, a series of launch failures each exited with code 0 and produced output files of plausible size. The sweep appeared to be running. It wasn't. The harness was failing open — producing empty or near-empty results while reporting success — because the failure mode (a tokenizer cache miss caused by an SSL handshake failure to an external host) happened before the extraction loop, not inside it.
The fix was found the same way every real production bug is found: checking file sizes, wall times, and chunks_written counts rather than trusting the exit code. A 3-second wall time on a cell that should take 5 minutes is the signal. A 761-byte result file where a real result is 4–5 KB is the signal.
The rule: in performance work, per-cell verification is mandatory. The harness can succeed at producing nothing. Build verification into the sweep design, not as an afterthought.
Parse-OK measures structural validity: did the model produce well-formed JSON with all fields populated? It cannot measure whether those values are accurate, specific, or complete. A model can pass parse-OK with all 11 fields confidently populated — and have every answer be about the wrong section of the bill.
Experiment 3 confirmed that parse-OK reliably ranks models at their respective quality-best configurations. That cross-model reliability is real and useful. Experiment 3.5 found that within a single model, parse-OK fails to rank chunk sizes reliably — and the failure mode differs by model:
The rule: don't generalize proxy reliability findings across models. Validate the proxy within each model family before using it for tuning decisions.
The single-cell quality-best (peak parse-OK in the controlled sweep) is a useful starting point, but it answers a narrow question: which chunk size performs best on a uniform sample of 4,500–5,000 token bills? The production corpus is not uniform. Its median bill is 1,051 tokens. 68% of bills fit under 2,000 tokens. 29% are under 500 tokens.
Three lenses on "best" all agree on Llama in this case, but the gaps differ: 20.5pp at single-cell best, 27.5pp at median-bill chunking, 13pp at P75 chunking. In a different corpus or with different models, the lenses might not agree — and the engineer who only checked the single-cell number would miss that the recommendation is corpus-dependent.
The rule: always test quality claims against the production token-length distribution, not just the controlled eval sample. The corpus shape determines the chunking strategy that matters.
When the extended sweep showed Mistral's parse-OK dropping from 72% at 3,000 tokens to 59.8% at 4,000 tokens, the first instinct is to conclude that 4,000-token chunks are harder. They're not — not in general. The eval sample happens to consist of bills in the 4,500–4,972 token range. At chunk_size=4,000 those bills split into one full 4,000-token primary chunk and one short 500–900-token tail chunk. The tail chunk has insufficient content to populate all 11 fields and correctly returns parse_partial.
At chunk_size=5,000 the same bills become single chunks. Parse-OK recovers to exactly 72.0%. The "dip" was a chunk-boundary interaction with the corpus length distribution — not a model failure, not a methodology flaw, just an artifact of how the eval sample happened to align with the chunk boundary.
The rule: when you see an unexpected performance cliff, check whether the eval corpus distribution is interacting with the configuration boundary before concluding the model has a capability limitation.
The production recommendation is a 7–8B quantized model on a single L4 GPU. Not a frontier model. Not a multi-GPU cluster. Not the largest available option. The smallest capable model at its optimal operating point, on hardware sized to the actual measured workload.
The SLO gap (3.16× on steady-state) is not a failure of this approach — it is the output of it. The gap is now measured, not assumed. The hardware procurement decision that follows is grounded in data: a 3× scaled hardware solution clears the steady-state SLO at Llama quality. Without this evaluation, that number would have been a guess. With it, it is a specification.
Smaller models also leave VRAM headroom for concurrent workloads on shared infrastructure. A 7-8B AWQ model uses approximately 4 GB VRAM at idle, leaving 18 GB for KV cache, concurrency headroom, and co-resident pipelines. The new server won't serve only this pipeline. That headroom has real operational value.
The principle: characterize the workload before sizing the hardware. Pick the smallest capable model that meets quality and throughput requirements. Reserve larger models and bigger hardware for the workload growth they actually serve — not for the workload you imagine you might have someday.
08 / Experiment 2
Experiment 1 answered which model. Experiment 2 answered at what concurrency — how much speedup each model delivers as requests scale up, where the diminishing-returns knee sits, and whether quality holds under load.
Three models at each model's quality-best chunk size, six concurrency levels (1, 4, 8, 12, 16, 24), three repeats per cell. Same 50-bill controlled sample and warm-start protocol as Experiment 1.
| Profile | Chunk size | Reason for inclusion |
|---|---|---|
llama31_1200 | 1,200 tokens | Quality champion from Exp 1 (92.5% parse-OK) |
qwen_500 | 500 tokens | Throughput champion candidate — note: qwen_3000 is Qwen's actual quality-best; see Exp 2b |
mistral_3000 | 3,000 tokens | Incumbent baseline at quality-best chunk size |
Throughput in chunks/sec as a function of concurrency. All three models scale meaningfully. Llama reaches 8.49× speedup at c=24, Qwen 11.76×, Mistral 4.93×. Dashed lines mark each profile's production SLO target.
Same data normalized to conc=1. Qwen comes closest to linear scaling (small chunks expose more parallelism). Llama scales smoothly. Mistral flattens early — its 3,000-token prompts are prefill-heavy.
Mistral's tail explodes 5× from c=1 to c=24. Qwen stays flat (9.5s → 18.8s). Llama falls in the middle (14s → 39s) — usable, the expected production operating range.
Quality does not degrade as concurrency increases. Maximum variation: 0.6pp (Llama 3.1: 91.0%→92.5%). The operating point can be chosen purely on throughput/latency grounds — no quality headroom needs to be budgeted against queue depth.
The original concurrency sweep tested qwen_500 because that was Qwen's apparent quality-best at the time. Exp 1b later confirmed qwen_3000 as Qwen's actual quality-best. This follow-up sweep completes the picture.
| Concurrency | Parse-OK | Chunks/sec | p50 | p99 | Speedup vs c=1 |
|---|---|---|---|---|---|
| 1 | 68.0% | 0.145 | 6.68s | 10.33s | 1.00× |
| 4 | 67.7% | 0.421 | 9.06s | 14.81s | 2.91× |
| 8 | 67.7% | 0.630 | 11.80s | 21.30s | 4.34× |
| 12 | 68.0% | 0.794 | 14.07s | 27.72s | 5.48× |
| 16 | 67.7% | 0.947 | 15.63s | 29.99s | 6.53× |
| 24 | 71.3% | 1.157 | 19.04s | 31.06s | 7.98× |
● What changed vs qwen_500
At chunk_size=3,000 Qwen's quality improves from 62.6% to 71.3% parse-OK — a real improvement. The p99 tail is slightly worse (31s vs 19s at c=24) but still far better than Mistral's 74s. Throughput is lower in chunks/sec because 3,000-token chunks are larger, but since bills become fewer chunks, bill-level throughput is comparable.
◆ What didn't change
Quality is still invariant under concurrency — 67.7–71.3% across all levels. And at 71.3% peak parse-OK, qwen_3000 still sits 21 percentage points below Llama at 92.5%. The quality gap that disqualifies Qwen in Stage 1 of the funnel is unchanged. Effective throughput (quality × chunks/sec) at c=24: 0.825 — marginally higher than Llama's 0.803, but at dramatically lower quality.
● The production-defining finding
Quality is invariant under concurrency for all three models. This eliminates an entire class of production risk: concurrency tuning is a hardware-utilization problem, not a quality trade-off problem. The right operating point for Llama is the one that maximizes throughput within the latency SLO — currently concurrency=12–24 on the L4.
09 / SLO Gate
The experiments characterize the model+hardware combination. This section answers whether it meets production requirements — using the real corpus distribution, not the controlled eval sample.
~75,000 bills/week at 50-state scale (50 states × ~1,500 bills/week peak). Processing window: 3 nights × 6 hours = 18 hours/week.
Required throughput: 1.157 bills/sec sustained · 2.41 bills/sec peak-burst.
The 50-bill eval sample is intentionally uniform (4,500–5,000 tokens). The real Illinois corpus has a median of 1,051 tokens — 29% of bills are under 500 tokens, 12% are over 10,000. Using the eval sample for SLO projection overstates the chunking workload by 1.4×–3.4× depending on profile. The right calculation uses the real corpus distribution.
Per-cell shortfall to steady-state SLO. None of the 18 operating points on a single L4 meets the SLO, but the gap ranges from 2.74× (Qwen at c=24) to 32× (Qwen at c=1). Smallest gap at acceptable quality: Llama at c=24, 3.16× short.
Gap multiplier at each model's best operating point (c=24). The dashed line is the SLO pass threshold. A 3× scaled solution (3 L4s or one A100-class GPU) clears the steady-state SLO at Llama quality. Peak-burst requires ~7×.
Every (model, concurrency) point plotted by throughput and quality. Per-profile SLO thresholds as vertical dashed lines. Llama dominates the upper portion on quality. The cluster of Llama points between c=12 and c=24 — 91–92.5% quality, within 3–4× of SLO — are the operating points to scale forward.
◆ What this means
The eval is not obsolete — it is the comparison baseline. When the new server arrives, run the same 54-cell concurrency sweep on the new hardware and compare cell-by-cell. The SLO gap multiplier will change; the methodology does not. The model recommendation (Llama) is hardware-agnostic — it is a property of the extraction task, not the GPU.
All figures derived from measured p99 throughput at the recommended operating point: Llama 3.1 8B AWQ, chunk_size=1,200, concurrency=24, single NVIDIA L4.
Measured sustained throughput: 0.88 chunks/sec · 2.40 chunks/bill (real corpus) → 0.367 bills/sec effective
Workload: 11,079 bills/week (Illinois only)
Required throughput: 0.179 bills/sec (11,079 ÷ 604,800s/week)
Measured throughput: 0.367 bills/sec
Headroom: 2.05× above SLO threshold — single L4 is sufficient with capacity to absorb session-to-session variance and co-resident workloads.
Workload: 75,000 bills/week (50-state peak)
Required throughput: 1.157 bills/sec sustained · 2.41 bills/sec peak-burst
Measured throughput: 0.367 bills/sec
Deficit: 3.16× below steady-state SLO · 6.57× below peak-burst SLO
Projected processing time at measured throughput: ~56.6 hours against an 18-hour window.
Capacity planning requirement to close the steady-state deficit: minimum 3.16× throughput increase — achievable via horizontal scaling (~3 L4s in parallel) or vertical scaling (single A100-class GPU, which delivers ~3–4× the L4's inference throughput for this model class). Peak-burst compliance requires ~6.6× — ~7 L4s or H100-class hardware. These are now hardware procurement specifications derived from measurement, not estimates.
11 / Quality Validation
Parse-OK checks structure. The qualitative review checks substance — accuracy, specificity, and completeness on real Illinois bill extractions.
Two rounds of scoring, documented in full in the Quality Review document ↗.
Experiment 3 (cross-model): 60 cells across three profiles at their quality-best configurations. Two independent passes by Claude Opus 4.8 (high effort then max effort). 60/60 agreement. Parse-OK reliably ranks the three configurations in the same order as substantive scoring.
Experiment 3.5 (within-model): Five pairs across chunk sizes bracketing each model's parse-OK peak. Found that parse-OK fails as a within-model proxy for all three models — quality peaks at the smaller chunk size in every informative pair. The mechanism is off-section drift: larger chunks cause the model to answer confidently about the wrong section of the bill.
● Cross-model finding (Exp 3)
Parse-OK reliably ranks models in the correct quality order. 60/60 confirmed. The cross-model proxy is validated.
◆ Within-model finding (Exp 3.5)
Parse-OK fails to rank chunk sizes within a model. Quality peaks at smaller chunk sizes for all three models — directionally reliable for Llama, actively misranking for Qwen, mildly reversed for Mistral.
Full per-chunk scoring with source text and model outputs: Quality Review — Experiments 3 & 3.5 ↗
09 / Optimization Paths
Each eliminated model failed at a specific, measurable stage. The telemetry tells us not just why each model failed, but what specific interventions would be required to rehabilitate it. This is performance engineering applied to model selection.
◆ The principle
A model isn't categorically bad — it's bad at a specific operating point. The measurements tell us which knob to turn. Whether the cost of turning that knob is worth it is a separate, explicit decision. Documenting the path matters because the pipeline will evolve: a future workload change, hardware upgrade, or prompt improvement might make a previously-eliminated model viable.
Root cause (measured): KV cache head-of-line blocking. At chunk_size=3,000 each prompt occupies ~3,000 × 2 bytes × 32 layers of KV cache. At concurrency=8+, the L4's 22.5 GB VRAM cannot hold all in-flight KV states simultaneously, triggering preemption. Measured signal: p99 grows 5× (15s→74s), cliff between c=4 and c=8.
Optimization path 1 — Reduce chunk size to 2,000 tokens. Our Exp 3.5 data shows mistral_2000 scores higher on quality (1.97 vs 1.82) than mistral_3000. Smaller chunks = smaller KV states = less cache pressure. Test: rerun concurrency sweep at chunk_size=2,000 and measure whether the p99 cliff moves out past c=16. If it does, Mistral becomes viable.
Optimization path 2 — vLLM swap space tuning. --swap-space controls CPU RAM offload when VRAM KV cache fills. Increasing from the default 4 GB to 16+ GB may absorb the burst without blocking the queue. Test: run the concurrency sweep with --swap-space 16 and compare p99 curves.
Optimization path 3 — Chunking strategy. If the real corpus has a higher proportion of short bills (<1,500 tokens), those bills never produce 3,000-token chunks and Mistral's KV issue never triggers. Measure corpus chunk-length distribution at chunk_size=2,000 and calculate what fraction of production traffic would hit the KV ceiling.
Root cause (measured): Two distinct failure modes. (1) Off-section drift at chunk_size=3,000: the model anchors on adjacent bill sections, producing specific answers about the wrong content. (2) None-emission: qwen_3000 returns parse_status=OK with all 11 fields set to "None." — structurally valid but substantively empty. The parse-OK proxy actively misranks these configurations.
Optimization path 1 — Prompt engineering for None-emission. The all-None failure is a prompt compliance failure, not a model capability failure. Adding explicit field-completion requirements ("every field must contain substantive extracted content; return 'not mentioned' rather than null") and two-shot examples of complete extractions may eliminate this failure mode. Test: rerun Exp 3.5 Pair 2 (Qwen 2000 vs 3000) with a v2 prompt and measure whether None-emission drops below 5%.
Optimization path 2 — chunk_size=2,000 as the operating point. Exp 3.5 shows qwen_2000 scores 1.93 quality vs qwen_3000's 1.57 — the best Qwen quality observed. Running a full concurrency sweep at qwen_2000 would complete the picture: does qwen_2000's tighter window also improve parse-OK rate and eliminate the proxy-reliability problem? If parse-OK at qwen_2000 exceeds qwen_3000's 67.7%, the proxy becomes reliable for within-Qwen tuning.
ITL signal (measured): Qwen's ITL p99 of 118ms vs Llama's 49ms indicates spiky decode behavior. This is likely caused by Qwen's tokenizer producing more variable-length outputs. Capping max_tokens at a lower value (e.g., 768 instead of 1,024) may stabilize ITL at the cost of occasionally truncating long responses.
Root cause (measured): Output budget pressure. At 5,000 input tokens + prompt overhead, insufficient token budget remains for complete 11-field JSON output. Evidence: 54% parse_partial, mean 1.2 missing fields, 0 failed extractions. The model is not failing — it is running out of output space.
Optimization path 1 — Increase max_tokens. Current config: max_tokens=1,024. A complete 11-field extraction averages ~185 tokens at 7B and likely more at 14B. Increasing to 2,048 gives ~2× more output headroom. Test cost: roughly 40–60% slower per call (more decode steps). Run 3 cells at max_tokens=2,048 and measure whether parse_partial drops below 20%.
Optimization path 2 — Reduce to chunk_size=2,000. Smaller input = more output budget remaining. This is the same observation as for Qwen 7B: chunk_size=2,000 may be the sweet spot for all three eliminated models. At 2,000 tokens the 14B model has ~3,000 tokens of output headroom — more than enough for 11 fields. Test: rerun Exp 4 at chunk_size=2,000 with the existing qwen14b chunker config.
Throughput note: Even if quality is fixed, 14B decode is inherently slower per token than 8B. The effective throughput ceiling at concurrency=4 is ~0.15–0.20 chunks/sec regardless of quality improvement. Unless decode speed improves with hardware, the 14B model will always be 2–3× slower than Llama — a trade-off that requires explicit justification.
All three eliminated models point to the same untested configuration: chunk_size=2,000. For Mistral it reduces KV cache pressure. For Qwen 7B it eliminates the off-section drift and proxy-reliability failures. For Qwen 14B it recovers output budget. The original sweep skipped from 1,200 to 2,000 and used 2,000 only as a mid-point, never as a primary operating point for any model.
A focused 9-cell experiment — 3 eliminated models × chunk_size=2,000 × 3 repeats — would complete the picture for all three at once. If Mistral_2000 clears the p99 threshold and qwen_2000's proxy becomes reliable, the competitive landscape changes. The Llama recommendation likely holds, but the margin would be explicitly measured rather than inferred.
This is Experiment 5 in the planned pipeline: targeted chunk_size=2,000 sweep, all three eliminated models, full concurrency characterization.
10 / Experiment 4
The hypothesis: Qwen 2.5 14B AWQ at chunk_size=5,000 — each eval bill as a single chunk, no splitting, no boundary artifacts — would outperform the 7B at 3,000 tokens. The result was the opposite.
Model: Qwen 2.5 14B Instruct AWQ — 14B parameters, ~9 GB VRAM
Chunk size: 5,000 tokens — each eval bill (4,500–4,972 tokens) becomes a single chunk. One inference call per bill. No splitting. No post-processing aggregation.
Concurrency: 4 · Repeats: 3 · Bills: 50 · Total runtime: 16 minutes
Chunks created by concatenating qwen_3000 chunk pairs back to full-bill text and writing directly to silver — bypassing the broken chunker CLI.
Off-section drift — the dominant failure mode in Experiments 3 and 3.5 — occurs because a larger chunk gives the model more bill context than it needs, causing it to anchor on adjacent sections. The natural fix: give it the whole bill as a single chunk, eliminating the section-boundary problem entirely.
A 14B model also has more capacity to handle complex multi-section bills. The expectation was meaningfully higher parse-OK — potentially approaching Llama's 92.5%.
◆ Result: The larger model performed worse
45.0% parse-OK — the lowest of any configuration tested. 47.5pp below Llama's 92.5%, and 22.7pp below the 7B Qwen at 3,000 tokens. Doubling parameters while eliminating chunking made quality worse, not better.
The 14B model produced 0 failed extractions. It's not crashing — it's returning valid JSON with incomplete field coverage. Mean missing fields: 1.2 out of 11. The model gets most fields but consistently fails to complete all 11 within the output token budget.
At 5,000 input tokens + prompt template overhead, the context window is heavily loaded. The model appears to be hitting output budget pressure — the prompt leaves insufficient room for a complete 11-field JSON response on every bill.
0.156 chunks/sec at concurrency=4 — less than half Llama's 0.338. p50 latency of 24.9 seconds per call vs Llama's 6.7 seconds. Since each bill is now one chunk (vs 2.4 chunks at chunk_size=1,200), bill-level throughput is 0.156 bills/sec vs Llama's 0.141 bills/sec — roughly comparable, but at less than half the quality.
The 14B model decodes more slowly per token than the 8B — AWQ quantization compresses the weight delta but not the fundamental decode cost of a larger model.
At 0.156 bills/sec, processing 75,000 bills would require ~134 hours — 7.4× the available 18-hour window, and more than twice the gap Llama faces. The 14B model is eliminated on both quality and throughput grounds.
● The architectural lesson
The failure of the 14B full-bill approach reveals something important: structured extraction at scale favors tight context windows over large models. Llama 3.1 8B at 1,200 tokens succeeds because each chunk is small enough that the model can focus precisely on its content and produce complete, accurate field values within the output budget. A 5,000-token bill floods the model's attention across the entire document — it sees everything but completes nothing reliably.
The off-section drift problem (Exp 3.5) is real, but the solution is not to eliminate chunking — it's to choose the right chunk size for the task. At 1,200 tokens, Llama stays on-chunk. The chunking artifacts that motivated this experiment are a configuration problem, not an architectural one.
| Profile | Model | Chunk size | Parse-OK | Chunks/sec | p50 | p99 | Quality | Verdict |
|---|---|---|---|---|---|---|---|---|
llama31_1200 | Llama 3.1 8B | 1,200 | 92.5% | 0.338 | 6.7s | 21s | 1.70 | ✓ PRODUCTION |
mistral_3000 | Mistral 7B | 3,000 | 72.0% | 0.212 | 22.4s | 74.1s | 1.53 | ✗ p99 explosion |
qwen_3000 | Qwen 2.5 7B | 3,000 | 67.7% | 0.421 | 9.1s | 14.8s | 1.57 | ✗ proxy misranks |
qwen_500 | Qwen 2.5 7B | 500 | 62.6% | 0.706 | 9.8s | 9.8s | 1.58 | ✗ quality ceiling |
qwen14b_5000 | Qwen 2.5 14B | 5,000 | 45.0% | 0.156 | 24.9s | 33.4s | TBD* | ✗ worst quality tested |
* Qualitative A/S/C review for qwen14b_5000 scheduled as follow-up. Parse-OK of 45% makes the outcome predictable — not prioritized before production deployment.
◆ Final conclusion
The production recommendation is confirmed and strengthened by Experiment 4. Llama 3.1 8B AWQ at chunk_size=1,200, concurrency=12–24 leads on every dimension that matters: parse-OK quality (92.5%), qualitative scoring (1.70/2.0, 60/60 confirmed), latency tail (39s p99 at c=24), and proxy reliability (directional within-model). A model nearly twice its size, given full-bill context, could not match it. The methodology is complete. The recommendation stands.
12 / What's Next
This evaluation answers the model selection question for the current hardware and scale. The experiments below extend it — to larger models, the full production corpus, and the hardware that doesn't exist yet.
Blind rubric scoring of 60 cells across the three quality-best configurations. Confirmed parse-OK reliably ranks models in the same order as substantive quality. 60/60 cells confirmed, zero revised between passes.
Five within-model pairs scored across chunk sizes bracketing each model's parse-OK peak. Found that parse-OK fails as a within-model proxy for all three models — quality peaks at the smaller chunk size in every informative pair. Characterized the failure mode (off-section drift) and established that parse-OK's reliability is model-dependent. Full scoring available in the Quality Review document ↗.
The original concurrency sweep (Exp 2) tested Qwen at chunk_size=500 — its apparent quality-best at the time. Exp 1b later confirmed qwen_3000 as Qwen's actual quality-best. This follow-up runs Qwen 2.5 7B AWQ at chunk_size=3,000 across the same six concurrency levels (1, 4, 8, 12, 16, 24), 3 repeats each, on the same 50-bill controlled sample. Delivers a directly comparable concurrency profile for Qwen's actual quality-best configuration. Estimated runtime: ~2–3 hours. Results will be incorporated into the concurrency section when complete.
Model downloaded and staged on the current L4 (C:\models\Qwen2.5-14B-Instruct-AWQ\, 9.29 GB across 3 shards). This is the natural "start small, scale up" next step: double the parameter count within a single-GPU deployable form factor and measure whether quality improves meaningfully.
The specific hypothesis: at 14B parameters, does the model handle more of the real corpus without chunking? The production corpus has a long right tail — 12% of bills exceed 10,000 tokens, and the longest bill is over 1 million tokens. If a 14B model with a larger native context window handles mid-length bills (say, 3,000–10,000 tokens) as single-chunk extractions rather than requiring splits, it may improve both quality (no chunk-boundary artifacts) and throughput (fewer inference calls per bill).
Design: same eval sample, same chunk-size sweep structure, same telemetry stack. Adds one concurrency-sweep cell at the quality-best chunk size found. Results compared cell-by-cell to the 7-8B baseline established here.
Note: throughput will be lower than the 7-8B models — a 14B model decodes slower and demands more KV cache. The interesting question is whether the quality gain justifies the throughput cost, and whether the effective throughput (quality-adjusted chunks/sec) still favors the smaller model.
The 50-bill controlled eval sample answers "how do these models compare on a stress-test distribution?" The 11,079-bill production corpus answers "how does the chosen model behave on real traffic?" Experiment 5 runs Llama 3.1 8B AWQ at the recommended configuration (chunk_size=1,200, concurrency=12–24) against the full Illinois corpus, with full instrumentation.
Key questions: does parse-OK hold within 2–3pp of the controlled-sample finding across the full length distribution? How do very short bills (<500 tokens, 29% of corpus) behave — do they produce single trivial chunks that inflate parse-OK artificially? How do very long bills (>10,000 tokens, 12% of corpus) behave — does quality degrade as chunk count per bill increases? The deliverable is a production cost model: total chunks, GPU-hours, and projected runtime for the full Illinois corpus and the 50-state target.
All experiments so far use a uniform chunk size across all bills. But the production corpus has a bimodal-ish length distribution — 29% of bills are under 500 tokens (fit in one small chunk), 12% are over 10,000 tokens (require many chunks at any reasonable size). The hypothesis: routing different bill-length buckets to different configurations may outperform any single uniform strategy on corpus-weighted parse-OK.
A simple adaptive policy: bills under 1,500 tokens → extract as a single chunk (no splitting); bills 1,500–5,000 tokens → chunk at 2,000 tokens; bills over 5,000 tokens → chunk at the standard 1,200 tokens with overlap. Each tier uses the model configuration optimized for that range. This eliminates tail-chunk artifacts for medium-length bills and avoids over-fragmenting short bills.
Design complexity is higher — requires tiered configs and a routing layer — but the potential quality improvement for the 40% of bills that fall in the middle range is worth measuring.
The methodology in this study transfers cleanly to any hardware. The absolute throughput numbers do not — they are specific to the current NVIDIA L4 (22.5 GB VRAM, 72W TDP). When the new AI Solutions Team server arrives, run the same 54-cell concurrency sweep and compare cell-by-cell. The before/after comparison quantifies the hardware upgrade's actual value rather than estimating it from spec sheets.
The SLO gate calculation updates automatically: plug the new measured throughput into the same real-corpus chunks-per-bill numbers and the gap multiplier resolves to whatever the new hardware supports.
◆ The methodology is the asset
The model recommendation (Llama 3.1 8B AWQ at chunk_size=1,200, concurrency=12–24) is the immediate output. The evaluation methodology — controlled-variable sweep design, per-request telemetry, defensive automation, qualitative validation, SLO projection from real corpus distribution — is the durable contribution. It ports to new models, new hardware, and new workloads without redesign. Every future experiment in this pipeline runs against the same baseline.