Tayler Erbe · Performance Engineering Case Study · 2026

Archival Image Pipeline
vLLM vs Ollama
& Backend Selection

A controlled performance characterization of a production image-classification pipeline on a single NVIDIA L4 GPU. Diagnosing request serialization in Ollama, then migrating to vLLM continuous batching delivered a 6.22× throughput improvement on the same hardware and same model — taking a 12,125-image archival corpus from 22 hours to 3.7 hours.

View Code & Full Report on GitHub → ← Related: Archival Image Intelligence

Workload

LLaVA Multimodal Classification

Hardware

NVIDIA L4 (23 GB VRAM, 72W)

Status

Evaluation Complete · Production Migration Planned

Role

Senior Data Scientist · Lead Engineer

Code

GitHub Repository →

6.22×

Throughput Speedup at C=8

22h → 3.7h

Corpus Processing Time

97%

vLLM GPU Utilization

12,125

Images in Production Corpus

The Problem

A production LLaVA image-classification pipeline at the University of Illinois System was running at 9 images per minute on a dedicated NVIDIA L4 GPU. At that rate, the full 12,125-image archival corpus required roughly 22 hours per pass — a constraint on iteration speed for prompt design, taxonomy refinement, and any operational re-run.

The pipeline was using Ollama 0.3.14 to serve llava-llama3, called sequentially from a Python orchestrator. The question driving this work was: is the bottleneck the hardware, the model, or the serving stack? The answer dictates radically different responses — hardware upgrade, model substitution, or backend migration.

This case study documents the controlled benchmark methodology used to isolate the bottleneck and the engineering choices that recovered the unused capacity.

Baseline Observation

At production settings the L4 was at 28–31% GPU utilization with 6.1 GB of 23 GB VRAM in use — and pulling 27W of its 72W TDP. The hardware was idle the majority of the time. A faster GPU would not have helped.

Investigation Framework

Three controlled experiments were designed to separate three independent variables: prompt design (call structure, output schema), client concurrency (1 / 2 / 4 / 8 in-flight requests), and serving stack (Ollama vs vLLM with the same model class). Holding two variables fixed while sweeping the third isolates the cause of any observed difference.

Constraints

Single dedicated L4 (no horizontal scaling option). On-prem environment with TLS-level block on HuggingFace (model artifacts pre-staged). Sole engineer. Production pipeline could not be paused during evaluation, requiring a parallel sandbox.

Methodology

Three experiments were run against the same image corpus and recorded with sub-percent precision. Each experiment isolates one variable; the benchmark harness sampled GPU utilization, VRAM, and power at one-second intervals concurrently with the inference run so that throughput and resource numbers come from the same window of clock time, not separate runs averaged after the fact.

Experiment 1 — Prompt Design (C-v4)

Compared a three-call split prompt (offensive flag, category/reason, description) against a single combined prompt with structured JSON output across all six fields. The combined prompt (C-v4) delivered 1.61× speedup (6.86s vs 11.05s per image, median) while holding parse rate at 100% on offensive/category/reason/confidence/description and 94% on title. Locked C-v4 as production prompt before any backend work began — controlling for prompt cost in subsequent experiments.

Experiment 2 — Ollama Concurrency Sweep

With C-v4 locked and llava-llama3 serving on Ollama, ran the same 50-image batch at client concurrency C = 1, 2, 4, 8. Recorded p50/p95 latency, end-to-end throughput, GPU utilization, VRAM, and wall power for each setting. Set OLLAMA_NUM_PARALLEL=4 to rule out a default-config explanation.

Experiment 3 — vLLM Concurrency Sweep

Built a parallel sandbox with vLLM 0.6.6 serving llava-1.5-7b in FP16. Repeated the C = 1, 2, 4, 8 sweep with the same image batch, the same prompt, and the same orchestrator code path. The only thing that changed between Experiment 2 and 3 was the inference server URL.

Apples-to-Apples Confirmation

To remove model-family as a confound, also re-ran the Ollama sweep with llava:7b — the same architecture vLLM was serving. This isolates the serving architecture as the sole variable, confirming the speedup is not attributable to a model swap.

The Diagnosis — Request Serialization

The Ollama concurrency sweep produced a signature pattern: throughput stayed flat at ~9 images per minute across all concurrency levels while p50 latency inflated by 8.18× from C=1 to C=8. GPU utilization held at 28–31% throughout. VRAM was constant. Power was flat at 27W.

This is the unmistakable fingerprint of request serialization at the serving layer: the daemon was accepting concurrent connections but processing them one at a time, so additional in-flight requests simply queued and inflated latency without producing any extra throughput. The hardware was never the bottleneck.

Fig. 01 — Throughput scaling with client concurrency. Ollama flat at C=1..8 regardless of model.

Fig. 02 — Steady-state GPU utilization by backend. 70 percentage points of idle compute on production.

Why It Was Plausibly Misread as a Hardware Limit

An L4 with 23 GB VRAM running a 7B-parameter multimodal model can look reasonable at 9 img/min if you only look at end-to-end throughput. Without sampling utilization concurrently, an operator could spend a quarter on a faster GPU and recover none of the unused capacity, because the bottleneck is not on the GPU at all.

Apples-to-Apples Control

Re-running Ollama with the same model architecture vLLM was serving (llava:7b) produced 15.3 img/min — flat across concurrency, 54% GPU. A 3.6× delta versus vLLM on identical model + identical hardware isolates the serving stack as the cause.

The Solution — vLLM Continuous Batching

vLLM 0.6.6 implements continuous batching: at each forward pass the scheduler can swap completed sequences out and queued sequences in, keeping the GPU saturated regardless of which token any given request is currently emitting. For a workload with variable-length outputs and many in-flight requests — exactly the shape of this image-classification pipeline — this is the right serving primitive.

With vLLM as the serving stack and no other change, throughput scaled cleanly with concurrency: 13.0 → 22.4 → 37.4 → 55.4 images per minute at C = 1, 2, 4, 8. GPU utilization rose to 96–97%. VRAM moved into the 19–20 GB range — the GPU was finally being used as intended. Wall power climbed to 71–72W, indicating real computational work rather than idle wait.

At C=8 the speedup over the original production configuration is 6.22×; the same-model speedup (eliminating any model-family contribution) is 3.6×. The 3.6× is the pure serving-architecture effect; the additional 1.7× comes from running a model that better fits the L4's compute profile.

Backend / Model	C=1	C=8	GPU%	VRAM
Ollama llava-llama3	9.1	8.9	28–31%	6.1 GB
Ollama llava:7b	14.9	15.3	54%	5.2 GB
vLLM llava-1.5-7b	13.0	55.4	96–97%	19–20 GB

Read the rightmost column. Ollama's VRAM is essentially the same regardless of concurrency. vLLM's VRAM grows because it is actually batching multiple sequences in parallel. That is the architectural difference, made visible in resource numbers.

Fig. 03 — p50 latency vs concurrency. Ollama: 8.18× inflation. vLLM: 1.86× inflation. Same hardware.

Fig. 04 — Wall time to process the 12,125-image archival corpus at production concurrency.

Production Validation

vLLM — Output Quality

Validated

Offensive Flag — Parse Rate 100%

Category — Parse Rate 100%

Reason — Parse Rate 100%

Confidence — Parse Rate 100%

Description — Parse Rate 100%

Title — Parse Rate 94%

Parse rates were measured on the same C-v4 prompt that ran in production. A two-line schema cleanup brought all six fields to production-acceptable rates. Spot-checks against the original Ollama outputs confirmed equivalent factual quality — and on at least one hero example (image 0000003, Memorial Stadium 1924), vLLM was more conservative, refusing to hallucinate a location the Ollama variant invented as "University of Texas."

vLLM 0.6.6 llava-1.5-7b FP16 Continuous Batching

Same-Model Control

Confound Removed

Ollama Model llava:7b

vLLM Model llava-1.5-7b

Architecture Same

Hardware Same L4

Speedup vLLM vs Ollama 3.6×

Attributable Cause Serving Stack

The model-architecture confound is a common objection to backend benchmarks. Re-running Ollama with the same model architecture vLLM was serving eliminates it: the 3.6× delta is purely the serving stack. The remaining 1.7× (3.6× → 6.22×) is the production model swap from llava-llama3 to llava-1.5-7b, which fits the L4 better.

Ollama 0.3.14 llava:7b Daemon Property

6.22×

Production Speedup at C=8

3.6×

Same-Model Backend Speedup

1.86×

vLLM Latency Inflation at C=8

Engineering Takeaways

The methodology choices that made the diagnosis possible are as important as the result itself. The relevant principles, in order of impact:

Sample resource metrics concurrently with the inference run

GPU utilization, VRAM, and power must come from the same window of wall-clock time as throughput and latency. Averaging post-hoc across separate runs hides the serialization signature entirely — high latency with low GPU is the diagnostic pattern, and it only appears if both are sampled in the same window.

Sweep concurrency, not just total request count

A flat throughput curve across C = 1, 2, 4, 8 with rising latency is a binary diagnostic for request serialization. Without the sweep, the single data point at the production concurrency setting looks like a hardware limit — and is almost always misread as one.

Remove confounds before claiming attribution

"vLLM is 6.22× faster than Ollama" is an unfalsifiable claim if the model also changed. The same-model rerun (llava:7b on Ollama vs llava-1.5-7b on vLLM, same architecture) decomposes the result into attributable components: 3.6× from serving stack, 1.7× from model fit.

Validate output quality before declaring a win

A throughput speedup that breaks downstream parsers or hallucinates new fields is a regression, not a win. Parse-rate measurement on the same C-v4 prompt — and spot-check comparisons against the production baseline — were required gates for the migration recommendation.

Cheaper experiments first, then commit

The prompt-design experiment (Experiment 1) was completed before any backend work began. A 1.61× speedup that costs zero infrastructure migration is the highest-ROI win available on most LLM workloads — and it controls for prompt cost in subsequent backend tests.

Technology Stack

Inference Stack

› vLLM 0.6.6 (winner)
› Ollama 0.3.14 (baseline)
› LLaVA-1.5-7B FP16
› llava-llama3, llava:7b

Hardware & Environment

› NVIDIA L4 (23 GB VRAM)
› CUDA 12.4 · 72W TDP
› 4 vCPU · 62 GB RAM
› On-prem Linux server

Benchmark Harness

› Python · asyncio
› nvidia-smi sampling
› pandas / pyarrow
› matplotlib

vLLM Continuous Batching LLaVA NVIDIA L4 GPU Profiling Performance Engineering Throughput Characterization Apples-to-Apples Benchmarking Python · asyncio

← Back to Portfolio

Archival Image PipelinevLLM vs Ollama& Backend Selection

The Problem

Methodology

The Diagnosis — Request Serialization

The Solution — vLLM Continuous Batching

Production Validation

Engineering Takeaways

Technology Stack

Archival Image Pipeline
vLLM vs Ollama
& Backend Selection