Tayler Erbe · Performance Engineering Case Study · 2026

Archival Image Pipeline
vLLM vs Ollama
& Backend Selection

A controlled performance characterization of a production image-classification pipeline on a single NVIDIA L4 GPU. Diagnosing request serialization in Ollama, then migrating to vLLM continuous batching delivered a 6.22× throughput improvement on the same hardware and same model — taking a 12,125-image archival corpus from 22 hours to 3.7 hours.

View Code & Full Report on GitHub → ← Related: Archival Image Intelligence
Workload
LLaVA Multimodal Classification
Hardware
NVIDIA L4 (23 GB VRAM, 72W)
Status
Evaluation Complete · Production Migration Planned
Role
Senior Data Scientist · Lead Engineer
6.22×
Throughput Speedup at C=8
22h → 3.7h
Corpus Processing Time
97%
vLLM GPU Utilization
12,125
Images in Production Corpus

The Problem

A production LLaVA image-classification pipeline at the University of Illinois System was running at 9 images per minute on a dedicated NVIDIA L4 GPU. At that rate, the full 12,125-image archival corpus required roughly 22 hours per pass — a constraint on iteration speed for prompt design, taxonomy refinement, and any operational re-run.

The pipeline was using Ollama 0.3.14 to serve llava-llama3, called sequentially from a Python orchestrator. The question driving this work was: is the bottleneck the hardware, the model, or the serving stack? The answer dictates radically different responses — hardware upgrade, model substitution, or backend migration.

This case study documents the controlled benchmark methodology used to isolate the bottleneck and the engineering choices that recovered the unused capacity.

Baseline Observation

At production settings the L4 was at 28–31% GPU utilization with 6.1 GB of 23 GB VRAM in use — and pulling 27W of its 72W TDP. The hardware was idle the majority of the time. A faster GPU would not have helped.

Investigation Framework

Three controlled experiments were designed to separate three independent variables: prompt design (call structure, output schema), client concurrency (1 / 2 / 4 / 8 in-flight requests), and serving stack (Ollama vs vLLM with the same model class). Holding two variables fixed while sweeping the third isolates the cause of any observed difference.

Constraints

Single dedicated L4 (no horizontal scaling option). On-prem environment with TLS-level block on HuggingFace (model artifacts pre-staged). Sole engineer. Production pipeline could not be paused during evaluation, requiring a parallel sandbox.

Methodology

Three experiments were run against the same image corpus and recorded with sub-percent precision. Each experiment isolates one variable; the benchmark harness sampled GPU utilization, VRAM, and power at one-second intervals concurrently with the inference run so that throughput and resource numbers come from the same window of clock time, not separate runs averaged after the fact.

01
Experiment 1 — Prompt Design (C-v4)
Compared a three-call split prompt (offensive flag, category/reason, description) against a single combined prompt with structured JSON output across all six fields. The combined prompt (C-v4) delivered 1.61× speedup (6.86s vs 11.05s per image, median) while holding parse rate at 100% on offensive/category/reason/confidence/description and 94% on title. Locked C-v4 as production prompt before any backend work began — controlling for prompt cost in subsequent experiments.
02
Experiment 2 — Ollama Concurrency Sweep
With C-v4 locked and llava-llama3 serving on Ollama, ran the same 50-image batch at client concurrency C = 1, 2, 4, 8. Recorded p50/p95 latency, end-to-end throughput, GPU utilization, VRAM, and wall power for each setting. Set OLLAMA_NUM_PARALLEL=4 to rule out a default-config explanation.
03
Experiment 3 — vLLM Concurrency Sweep
Built a parallel sandbox with vLLM 0.6.6 serving llava-1.5-7b in FP16. Repeated the C = 1, 2, 4, 8 sweep with the same image batch, the same prompt, and the same orchestrator code path. The only thing that changed between Experiment 2 and 3 was the inference server URL.
04
Apples-to-Apples Confirmation
To remove model-family as a confound, also re-ran the Ollama sweep with llava:7b — the same architecture vLLM was serving. This isolates the serving architecture as the sole variable, confirming the speedup is not attributable to a model swap.

The Diagnosis — Request Serialization

The Ollama concurrency sweep produced a signature pattern: throughput stayed flat at ~9 images per minute across all concurrency levels while p50 latency inflated by 8.18× from C=1 to C=8. GPU utilization held at 28–31% throughout. VRAM was constant. Power was flat at 27W.

This is the unmistakable fingerprint of request serialization at the serving layer: the daemon was accepting concurrent connections but processing them one at a time, so additional in-flight requests simply queued and inflated latency without producing any extra throughput. The hardware was never the bottleneck.

THROUGHPUT VS CONCURRENCY OLLAMA SERIALIZES · vLLM BATCHES 0 10 20 30 40 50 60 IMAGES / MINUTE C = 1 C = 2 C = 4 C = 8 CLIENT CONCURRENCY (PARALLEL IN-FLIGHT REQUESTS) 55.4 15.3 8.9 vLLM · llava-1.5-7b (FP16) Ollama · llava:7b (Q4_0) Ollama · llava-llama3 (Q4_0)
Fig. 01 — Throughput scaling with client concurrency. Ollama flat at C=1..8 regardless of model.
GPU UTILIZATION BY BACKEND vLLM SATURATES THE L4 · OLLAMA LEAVES IT IDLE 0% 20% 40% 60% 80% 100% MEDIAN GPU UTILIZATION 28% 54% 96% C = 1 26% 54% 96% C = 2 27% 54% 97% C = 4 31% 54% 97% C = 8 CLIENT CONCURRENCY vLLM · llava-1.5-7b (FP16) Ollama · llava:7b (Q4_0) Ollama · llava-llama3 (Q4_0)
Fig. 02 — Steady-state GPU utilization by backend. 70 percentage points of idle compute on production.
Why It Was Plausibly Misread as a Hardware Limit

An L4 with 23 GB VRAM running a 7B-parameter multimodal model can look reasonable at 9 img/min if you only look at end-to-end throughput. Without sampling utilization concurrently, an operator could spend a quarter on a faster GPU and recover none of the unused capacity, because the bottleneck is not on the GPU at all.

Apples-to-Apples Control

Re-running Ollama with the same model architecture vLLM was serving (llava:7b) produced 15.3 img/min — flat across concurrency, 54% GPU. A 3.6× delta versus vLLM on identical model + identical hardware isolates the serving stack as the cause.

The Solution — vLLM Continuous Batching

vLLM 0.6.6 implements continuous batching: at each forward pass the scheduler can swap completed sequences out and queued sequences in, keeping the GPU saturated regardless of which token any given request is currently emitting. For a workload with variable-length outputs and many in-flight requests — exactly the shape of this image-classification pipeline — this is the right serving primitive.

With vLLM as the serving stack and no other change, throughput scaled cleanly with concurrency: 13.0 → 22.4 → 37.4 → 55.4 images per minute at C = 1, 2, 4, 8. GPU utilization rose to 96–97%. VRAM moved into the 19–20 GB range — the GPU was finally being used as intended. Wall power climbed to 71–72W, indicating real computational work rather than idle wait.

At C=8 the speedup over the original production configuration is 6.22×; the same-model speedup (eliminating any model-family contribution) is 3.6×. The 3.6× is the pure serving-architecture effect; the additional 1.7× comes from running a model that better fits the L4's compute profile.

Backend / Model C=1 C=8 GPU% VRAM
Ollama llava-llama3 9.1 8.9 28–31% 6.1 GB
Ollama llava:7b 14.9 15.3 54% 5.2 GB
vLLM llava-1.5-7b 13.0 55.4 96–97% 19–20 GB
Read the rightmost column. Ollama's VRAM is essentially the same regardless of concurrency. vLLM's VRAM grows because it is actually batching multiple sequences in parallel. That is the architectural difference, made visible in resource numbers.
PER-REQUEST LATENCY VS CONCURRENCY LINEAR QUEUEING ON OLLAMA · NEAR-FLAT ON vLLM 0s 10s 20s 30s 40s 50s 60s MEDIAN REQUEST LATENCY (SECONDS) C = 1 C = 2 C = 4 C = 8 CLIENT CONCURRENCY 52.5s 31.2s 8.2s Ollama · llava-llama3 (Q4_0) Ollama · llava:7b (Q4_0) vLLM · llava-1.5-7b (FP16)
Fig. 03 — p50 latency vs concurrency. Ollama: 8.18× inflation. vLLM: 1.86× inflation. Same hardware.
WALL-CLOCK TIME · FULL 12,125-IMAGE CORPUS CONTINUOUS INFERENCE · BEST CONCURRENCY PER BACKEND 0h 5h 10h 15h 20h 25h WALL-CLOCK HOURS Ollama · llava-llama3 (Q4_0) 22.2 hrs 9.1 img/min Ollama · llava:7b (Q4_0) 13.1 hrs 15.4 img/min vLLM · llava-1.5-7b (FP16) 3.7 hrs 55.4 img/min
Fig. 04 — Wall time to process the 12,125-image archival corpus at production concurrency.

Production Validation

vLLM — Output Quality
Validated
Offensive Flag — Parse Rate 100%
Category — Parse Rate 100%
Reason — Parse Rate 100%
Confidence — Parse Rate 100%
Description — Parse Rate 100%
Title — Parse Rate 94%
Parse rates were measured on the same C-v4 prompt that ran in production. A two-line schema cleanup brought all six fields to production-acceptable rates. Spot-checks against the original Ollama outputs confirmed equivalent factual quality — and on at least one hero example (image 0000003, Memorial Stadium 1924), vLLM was more conservative, refusing to hallucinate a location the Ollama variant invented as "University of Texas."
vLLM 0.6.6 llava-1.5-7b FP16 Continuous Batching
Same-Model Control
Confound Removed
Ollama Model llava:7b
vLLM Model llava-1.5-7b
Architecture Same
Hardware Same L4
Speedup vLLM vs Ollama 3.6×
Attributable Cause Serving Stack
The model-architecture confound is a common objection to backend benchmarks. Re-running Ollama with the same model architecture vLLM was serving eliminates it: the 3.6× delta is purely the serving stack. The remaining 1.7× (3.6× → 6.22×) is the production model swap from llava-llama3 to llava-1.5-7b, which fits the L4 better.
Ollama 0.3.14 llava:7b Daemon Property
6.22×
Production Speedup at C=8
3.6×
Same-Model Backend Speedup
1.86×
vLLM Latency Inflation at C=8

Engineering Takeaways

The methodology choices that made the diagnosis possible are as important as the result itself. The relevant principles, in order of impact:

01
Sample resource metrics concurrently with the inference run
GPU utilization, VRAM, and power must come from the same window of wall-clock time as throughput and latency. Averaging post-hoc across separate runs hides the serialization signature entirely — high latency with low GPU is the diagnostic pattern, and it only appears if both are sampled in the same window.
02
Sweep concurrency, not just total request count
A flat throughput curve across C = 1, 2, 4, 8 with rising latency is a binary diagnostic for request serialization. Without the sweep, the single data point at the production concurrency setting looks like a hardware limit — and is almost always misread as one.
03
Remove confounds before claiming attribution
"vLLM is 6.22× faster than Ollama" is an unfalsifiable claim if the model also changed. The same-model rerun (llava:7b on Ollama vs llava-1.5-7b on vLLM, same architecture) decomposes the result into attributable components: 3.6× from serving stack, 1.7× from model fit.
04
Validate output quality before declaring a win
A throughput speedup that breaks downstream parsers or hallucinates new fields is a regression, not a win. Parse-rate measurement on the same C-v4 prompt — and spot-check comparisons against the production baseline — were required gates for the migration recommendation.
05
Cheaper experiments first, then commit
The prompt-design experiment (Experiment 1) was completed before any backend work began. A 1.61× speedup that costs zero infrastructure migration is the highest-ROI win available on most LLM workloads — and it controls for prompt cost in subsequent backend tests.

Technology Stack

Inference Stack
  • vLLM 0.6.6 (winner)
  • Ollama 0.3.14 (baseline)
  • LLaVA-1.5-7B FP16
  • llava-llama3, llava:7b
Hardware & Environment
  • NVIDIA L4 (23 GB VRAM)
  • CUDA 12.4 · 72W TDP
  • 4 vCPU · 62 GB RAM
  • On-prem Linux server
Benchmark Harness
  • Python · asyncio
  • nvidia-smi sampling
  • pandas / pyarrow
  • matplotlib
vLLM Continuous Batching LLaVA NVIDIA L4 GPU Profiling Performance Engineering Throughput Characterization Apples-to-Apples Benchmarking Python · asyncio
← Back to Portfolio