top of page

Two Technologies That Reduce AI Model Deployment Costs: Quantization and Prefix Caching

  • May 7
  • 9 min read

Hi, I'm Jinbeom Kim, a Software Developer on the AIEEV Dev Team.

I studied computer science through both undergrad and graduate school, and I've been with AIEEV since the early days of the company — working on how we can operate more distributed GPU resources efficiently within Air Cloud 😊


In this post, I want to walk through two techniques we regularly evaluate when thinking about how to deploy AI models more efficiently.


The first is Quantization — a method for reducing memory usage and memory bandwidth pressure by representing model weights (and, depending on the approach, activations and KV cache values) in lower precision.


The second is Prefix Caching — a method that avoids recomputing the same input segments across requests by reusing previously computed results for shared prompt prefixes.





Where Resources Are Spent During LLM Inference


Before diving into the two techniques, it helps to understand where resources actually go when an LLM processes a request. There are three main phases:


  1. Model loading — keeping model weights resident in GPU memory

  2. Prefill — processing the input prompt and constructing the internal state (KV cache)

  3. Decode — generating output tokens one at a time


Quantization and Prefix Caching target different phases here.


Quantization primarily reduces the memory footprint of model weights. In decode-heavy, memory-bandwidth-constrained environments, it can also improve throughput — since each token generation requires repeated access to weights and KV cache, using lower-precision representations means less data to read and hold in GPU memory.


Prefix Caching reduces redundant computation during prefill. When multiple requests share a common prefix, it reuses the already-computed KV cache for that segment instead of reprocessing it from scratch.



What Happens Without Optimization


Without these optimizations, model weights sit in GPU memory at full precision, and every request — regardless of how much it overlaps with previous ones — recomputes the entire input from scratch.


As models get larger, memory requirements scale with them. More GPU memory is needed just to load the model, which constrains how many requests can be handled concurrently on the same hardware. On the input side, even when system prompts, tool definitions, output format instructions, and reference documents are identical across requests, the absence of caching means running the prefill computation every single time.


From the user's perspective, each request looks unique. From the server's perspective, the same system prompt and tool definitions are being processed over and over again.

At low traffic, this redundancy is easy to overlook. But as load increases, maintaining consistent throughput and latency on fixed GPU resources becomes much harder — and repeated input computation only adds to the prefill time, directly affecting time-to-first-token (TTFT).


Running AI services reliably isn't just a question of scaling up to bigger GPUs. It also means reducing how much memory the model consumes and eliminating redundant computation wherever possible.



Quantization: Representing the Same Model with Smaller Numbers


LLMs are, at their core, enormous collections of matrix operations. Model weights and the intermediate values produced during inference are all numbers — and those numbers are continuously read and computed on GPU memory. Higher precision is better for model quality, but it comes at a cost: more memory and more bandwidth.

Quantization addresses this by approximating those numbers in lower precision, allowing the same model to be operated with less memory and lower bandwidth pressure.


양자화는 이러한 숫자를 더 낮은 정밀도로 근사해, 같은 모델을 더 적은 메모리와 대역폭으로 다룰 수 있게 하는 방법입니다. 아래 그림은 이 개념을 단순화해 보여줍니다.


Quantization concept: approximating FP32 values into INT4 representations (Source: Maarten Grootendorst, "A Visual Guide to Quantization")
Quantization concept: approximating FP32 values into INT4 representations (Source: Maarten Grootendorst, "A Visual Guide to Quantization")

The diagram above shows what this looks like — high-precision floating-point values (like FP32) being mapped to low-precision integers (like INT4). Fewer bits means less data to store and read, but multiple original values may map to the same approximated value, which is why quality validation before deployment is essential.


In practice, models commonly use FP16 or BF16 weights, and quantization may bring some or all of those weights down to INT8, INT4, or FP8.


Data Type

Bits

Theoretical Size vs. FP16/BF16

FP32

32

2.0x

FP16 / BF16

16

1.0x

FP8

8

0.5x

INT8

8

0.5x

INT4

4

0.25x


These numbers look clean — INT8 halves the size, INT4 reduces it to a quarter. But it's worth treating these as theoretical storage estimates, not guaranteed end-to-end savings. In real deployments, GPU memory is shared across weights, KV cache, activations, and temporary buffers. Scaling metadata, packing formats, and alignment padding all contribute too. The actual reduction in total GPU memory usage won't always match the weight compression ratio.


It's also worth noting that quantization doesn't change a model's architecture — the number of layers, attention structure, and overall shape all stay the same. What changes is how the numbers representing weights and some intermediate values are stored and processed.


And the implementation is more involved than a simple type cast. Applying lower-precision values in computation requires consistent scaling strategies, group- or block-level quantization schemes, packing formats, and specialized kernels. Most inference engines abstract away these details, but in practice you'll still need to verify that the quantized model format is actually supported by your inference engine and GPU environment.



Why Resource Usage Decreases


Two effects are worth expecting:


First, reduced GPU memory usage. When the memory footprint of model weights shrinks, there's more room on the same GPU — which may open up larger batch sizes or higher concurrent request capacity.


Second, reduced memory bandwidth pressure. During decode, every token generation requires repeated reads of model weights and KV cache. In bandwidth-constrained environments, reducing weight representation size means less data to fetch from GPU memory per step — which can translate to throughput improvements when conditions are right.


That said, the improvement won't be proportional to the weight compression ratio alone. KV cache, activations, and runtime buffers are still in the picture, and additional metadata or dequantization overhead may arise depending on the approach. Actual memory savings and throughput gains vary based on model architecture and inference configuration.



What to Watch Out For


Quantization can reduce GPU memory usage, but quality validation before deployment is non-negotiable. Lower precision means approximating original values with fewer bits, and model quality can degrade in the process.

It's also critical to confirm that your specific quantization method is well-supported by your GPU hardware and inference engine. The same INT4 or FP8 scheme can behave very differently across hardware generations, inference frameworks, and kernel implementations.


Hardware Support Matrix by vLLM Quantization Implementation (Source: Official vLLM Documentation)
Hardware Support Matrix by vLLM Quantization Implementation (Source: Official vLLM Documentation)

The chart above shows hardware support ranges for different quantization implementations in vLLM. You don't need to internalize every entry — the key takeaway is that supported hardware and kernels vary across quantization methods.


Rather than choosing based on bit-width alone, verify the combination of model format, quantization method, GPU generation, and inference engine support. And when memory is tight, stepping down gradually — FP16 → INT8 → INT4 — with quality and performance checks at each step is safer than jumping straight to the lowest precision available.


The goal of quantization is to reduce GPU memory consumption and per-request resource usage while staying within an acceptable range for model quality.



Prefix Caching: Reusing the Repeated Parts of a Prompt


If quantization reduces the weight of the model itself, Prefix Caching reduces the work of processing repetitive inputs. LLM prompts generally split into two regions: a fixed region that's shared across many requests, and a variable region that changes with each request. Consider these two:



Request 1:
[You are an expert in systems software. Always respond in English.]
+
[Explain the architecture and operating principles of Linux OS from a systems perspective.]

Request 2:
[You are an expert in systems software. Always respond in English.]
+
[Explain what blockchain is in simple terms.]

The user questions are different, but the system prompt is identical. Without caching, the model recomputes that system prompt from scratch for both requests. At low volume, this is negligible. But with longer system prompts, extensive tool definitions, and increasing traffic, the redundant computation compounds quickly.


The shared leading portion of a prompt is the prefix. Prefix Caching stores the computed results (KV cache) for that common prefix, and reuses them for any subsequent request that begins with the same token sequence — skipping the prefill computation for that segment entirely. vLLM's Automatic Prefix Caching (APC) works on exactly this principle: preserving KV cache from previous requests and reusing it when a new request shares the same prefix.



How Prefix Caching Works


KV cache generation and reuse flow during LLM inference (Source: HuggingFace Docs)
KV cache generation and reuse flow during LLM inference (Source: HuggingFace Docs)

As shown above, LLMs generate key/value tensors for attention computation at each layer as they process input tokens. These tensors — collectively called the KV cache — are referenced again during subsequent token generation. Inference breaks down into two stages. During prefill, the model processes the full input prompt and builds the KV cache for each token. During decode, new tokens are generated one at a time, each referencing the accumulated KV cache.


Prefix Caching is an optimization for the prefill stage: once the KV cache for a prefix is computed during the first request, the inference engine stores it. When a new request arrives with the same prefix, that segment is reused — and only the new, unseen portion of the input is processed from scratch. This optimization is especially impactful in "long shared input + short unique query" workloads. AI agents are a natural fit: tools like Claude Code reference a CLAUDE.md file, and Codex-style agents reference AGENTS.md — project-level instructions covering build steps, code style, review criteria, and more. The user's actual task changes with every request, but the project instructions often stay the same across an entire session.


Request 1: [Project instructions: CLAUDE.md or AGENTS.md] + [Fix the login API bug]

Request 2: [Project instructions: CLAUDE.md or AGENTS.md] + [Add tests to the payment module]

Request 3: [Project instructions: CLAUDE.md or AGENTS.md] + [Review this diff from a PR perspective]

All three have different user requests, but the same project instructions. Without caching, the model prefills those instructions three times. With Prefix Caching in place, it prefills them once and reuses the result for requests 2 and 3.


This is becoming increasingly relevant as LLM services move toward longer input contexts. Simple Q&A prompts are relatively short, but agents routinely pass in system instructions, project rules, tool definitions, output format specs, reference documents, and conversation history — all before a single output token is generated. When many requests share the same long prefix, the cumulative cost of recomputing it grows proportionally.


Prefix Caching is less a "prompt storage feature" and more an optimization strategy for managing prefill costs and TTFT in AI agent workloads and long-context LLM services — particularly where long, stable prefixes repeat across a consistent set of requests.


What to Watch Out For


Prefix Caching only delivers value when requests actually share prefixes frequently. If system prompts change per request, or if variable values like user IDs or timestamps are embedded in the prompt prefix, cache hit rates will drop. Even prompts that look nearly identical to a human reader won't match as a shared prefix if the underlying token sequences differ.


To get the most out of Prefix Caching, keep prompt structure stable. In general, static content — system instructions, tool definitions, output formatting — belongs at the front, and per-request variable content belongs at the back. RAG contexts can also benefit if the same document set recurs across multiple queries; a long policy document or manual queried repeatedly is a good candidate for prefix reuse.


Memory is another consideration. The KV cache preserved for prefix reuse still occupies GPU memory. More cache means more reuse potential, but also less headroom for concurrent requests and longer context windows. Finding the right balance is part of the tuning process.


One last point: prefix matching is based on token sequences, not semantic similarity. Prompts that appear nearly identical in plain text may produce different token sequences depending on templating and serialization choices. Managing prompt templates consistently can have a significant impact on cache efficiency.


The goal of Prefix Caching is not to cache everything — it's to reuse stable, high-repetition input segments reliably enough to reduce prefill costs and improve time-to-first-token.





Closing


This post covered two techniques we regularly evaluate when thinking about LLM deployment efficiency: Quantization and Prefix Caching.


To summarize: Quantization represents model weights (and optionally activations and KV cache) in lower precision to reduce memory usage and bandwidth pressure. Prefix Caching reuses computed KV cache for shared prompt prefixes to eliminate redundant prefill computation.


Neither technique delivers guaranteed improvements by default. Quantization requires validating the quality-efficiency tradeoff at each precision level. Prefix Caching requires confirming that your actual workload consistently produces cache-hit-worthy prefixes. How far to push either approach depends on your model, prompt structure, traffic patterns, GPU environment, and quality requirements.


At AIEEV, we're continuing to explore how these techniques apply to AirAPI workloads. If the opportunity comes up, I'd like to share more in a follow-up post — including what the application process actually looks like and what we observe along the way.


Thanks for reading.








| Dev Team

| Author: JB Kim

| Site: Linkedin

Blog
bottom of page