Cache-Aware Skill Design

How prompt caching, KV cache, and stable instruction modules can change the cost of agent workflows

Jun 08, 2026

Prompt caching is often described as a cost optimization.

If a model provider sees the same input tokens again, those repeated tokens may be processed at a lower cost. That description is accurate, but incomplete.

OpenAI’s prompt caching docs describe cache hits as exact prefix reuse and recommend putting static content at the beginning of the prompt, with variable content near the end. A cache hit means the model server has recently processed the same prompt prefix during inference, so it can reuse stored model state for that matching portion of the input.

That detail matters for agent design.

Agents routinely send repeated context across turns and tasks: tool definitions, system prompts, Skill instructions, output contracts, examples, source-handling rules, conversation history, retrieved documents, and tool results.

Some of that context is stable. Some of it changes on every run.

Prompt caching can reward systems that separate those two categories cleanly.

A Skill with stable instructions, examples, and output rules can become a reusable prompt prefix. A Skill that places timestamps, run IDs, retrieved documents, or task-specific state before its stable instructions may reduce the opportunity for cache reuse.

The practical implication is straightforward:

A well-designed Skill does more than tell the model what to do. It also gives the model server a stable structure it can reuse.

This is why prompt caching should not be treated only as a pricing feature. For agents and Skills, prompt structure can become part of system architecture.

Cached Tokens Mean Reused Computation

The phrase “cached tokens” can make prompt caching sound like text storage.

That framing misses the mechanism.

The model server is not caching a response. It is checking whether the new request begins with a prefix it has already processed. When the prefix matches, the server can reuse stored model state for that matching portion of the input.

The same OpenAI docs also recommend placing static content at the beginning of the prompt and variable content near the end.

That recommendation is the first design rule.

Stable material belongs early:

[system instructions]
[tool definitions]
[Skill instructions]
[output contract]
[examples]

Variable material belongs later:

[current task]
[retrieved documents]
[tool results]
[timestamps]
[run IDs]

Prompt Caching visualization — Prompt caching starts with prefix alignment. If the new request begins with the same token pattern, the serving layer can reuse cached state. If the beginning changes, the reusable prefix can collapse, even when later parts of the prompt look familiar.

The important word is prefix.

Prompt caching does not usually search the whole prompt for similar meaning. It does not see that two prompts both mention the same document, paragraph, or phrase and automatically reuse that work wherever it appears. The cached state depends on the exact token sequence, its order, and where that sequence appears in the prompt.

That makes small layout choices matter.

A timestamp at the top of the prompt can change the prefix.

A random run ID can change the prefix.

A retrieval system that inserts source chunks before the stable Skill body can change the prefix.

A tool description that includes dynamic runtime state can change the prefix.

Each of those choices may be reasonable in isolation. Together, they make the beginning of the prompt less stable. That reduces the amount of work the serving layer can reuse.

For agent systems, this is the practical consequence:

Prompt caching rewards stable beginnings.

The stable part of the agent should be early. The variable part should be later.

What Is Actually Being Cached?

Prompt caching is easier to understand if we separate four layers:

tokens
attention
KV cache
prompt cache

Tokens are the units the model processes. The prompt is not handled as raw prose and is instead broken into tokens first.

Attention is the mechanism the model uses to relate those tokens to one another.

KV cache is the stored attention state created while the model processes tokens.

Prompt cache is the serving-layer feature that can reuse that stored state when a later request starts with the same prefix.

The confusing part is the word “key.”

In normal software, a cache usually has a key and a value:

cache[key] = value

Prompt caching has something like that too:

prompt_cache[hash(exact_token_prefix)] = stored_model_state

But the “K” in KV cache is not the hash used to look up a cached prompt prefix.

The original Transformer paper defines attention over queries, keys, and values. That is where the terminology comes from. In the KV cache, the K is an attention key and the V is an attention value. They are internal tensors created by the model during inference, not the lookup key and value of a normal software cache.

That distinction matters.

A simplified version looks like this:

Cache lookup key: 
hash(exact token prefix)  

Cached value: 
attention key tensors + attention value tensors

When we say prompt caching reuses KV cache, we are not saying the model is doing a database lookup where prompt text maps to an answer.

We are saying the serving layer can find a matching prompt prefix and reuse the key/value attention state the model already computed for that prefix.

Sebastian Raschka’s KV cache walkthrough gives a concrete inference example: as a model generates one token at a time, it can reuse previously computed key and value vectors instead of recomputing them at each step.

vLLM’s prefix-caching docs describe the cross-request version: processed requests leave behind KV-cache blocks, and later requests with the same prefix can reuse those blocks instead of recomputing them.

That is the bridge between the API feature and the model internals.

The API exposes the result as cached tokens. The serving system manages the cache. The model state being reused is tied to attention.

Why KV Cache Exists

KV cache exists because generation is sequential.

A model does not write an entire response at once. It generates one token, appends that token to the context, then generates the next token.

A simplified sequence looks like this:

Prompt:
Time 

Step 1:
Time → flies  

Step 2: Time 
flies → fast  

Step 3: 
Time flies fast → .

At each step, the model needs access to the tokens that came before.

Without a KV cache, the model would repeatedly recompute the attention keys and values for tokens it had already processed.

Step 1: 
compute K/V for “Time”  

Step 2: 
compute K/V for “Time” again 
compute K/V for “flies”  

Step 3: 
compute K/V for “Time” again 
compute K/V for “flies” again 
compute K/V for “fast”

That is wasted work.

With KV cache, the earlier tokens do not need to be recomputed every time.

Step 1: 
compute K/V for “Time”
store it  

Step 2: 
reuse K/V for “Time”
compute K/V for “flies” 
store it  

Step 3: 
reuse K/V for “Time” 
reuse K/V for “flies” 
compute K/V for “fast” 
store it

This is the basic inference-time benefit of KV cache.

It does not make the model smarter. It does not change the answer. It reduces repeated computation.

Sebastian Raschka’s KV cache walkthrough gives the clean version of this example: during autoregressive generation, the model would otherwise recompute key and value vectors for earlier tokens at each step.

Prompt caching extends this idea across requests.

Within one response, KV cache lets the model reuse state from earlier tokens in the same generation.

Across requests, prompt caching lets the serving layer reuse state from a previous request when a new request starts with the same prefix.

That is the bridge we need for agents.

Prompt Caching Extends KV Reuse Across Requests

KV cache usually starts inside a single generation.

The model processes a prompt, creates key/value attention state, and reuses that state as it generates the next token, then the next token, then the next.

Prompt caching moves the reuse boundary.

Instead of reusing prior token state only inside one response, the serving layer can reuse state from a previous request when a new request starts with the same prefix.

Request 1:

[stable Skill instructions] 
[stable output contract] 
[stable examples] 
[current task A]

Request 2:

[stable Skill instructions] 
[stable output contract] 
[stable examples] 
[current task B]

The beginning is the same.

The model server does not need to process that shared prefix as if it were new every time. It can reuse the state computed when it processed the earlier request, then continue from the new suffix.

That is the practical bridge between KV cache and prompt caching.

Within one response:
reuse prior token state from the same generation

Across requests:
reuse prior prefix state from an earlier request

The API usually hides the details. You see the result as cached input tokens, lower cached-token pricing, or lower latency when the cache hit affects the prefill path.

The implementation underneath is still about model state.

vLLM’s prefix-caching docs describe this directly: the system caches KV-cache blocks from processed requests and reuses those blocks when a new request has the same prefix.

That is why prompt caching is not only a billing abstraction. It is an inference-serving optimization exposed through the API.

Why Prefix Order Matters

Prompt caching is strict about order.

The cache does not look for familiar words scattered throughout the prompt. It looks for a matching beginning.

That means these two prompts are not equivalent:

Request 1: 
[stable Skill instructions]
[dynamic source documents] 
[current task]  

Request 2: 
[stable Skill instructions] 
[different source documents] 
[different task]

In both requests, the stable Skill instructions come first. The shared prefix is intact.

Now compare that with this layout:

Request 1: 
[timestamp A] 
[dynamic source documents A] 
[current task A] 
[stable Skill instructions]  

Request 2: 
[timestamp B] 
[dynamic source documents B] 
[current task B] 
[stable Skill instructions]

The stable Skill instructions are still present, but they are no longer the beginning of the prompt.

The prefix changed before the reusable material appeared.

That is the failure mode.

This is why a small amount of dynamic text at the top of the prompt can matter. A timestamp, run ID, tool result, or changing retrieval block can move the entire request out of alignment.

The model server may still receive the same Skill body later in the prompt. But for prefix caching, later is often too late.

OpenAI’s prompt caching docs make the design rule explicit: static content should go near the beginning of the prompt, and variable content should go near the end.

For agents, that becomes a concrete layout rule:

Stable first. 
Dynamic second.

That rule is simple, but it changes how an agent harness should be written.

Tool definitions, system instructions, Skill bodies, output contracts, examples, and validation rules should be stable and early.

Retrieved documents, timestamps, run IDs, tool outputs, and task-specific state should be later.

The serving layer can only reuse the prefix you actually give it.

Methods for Prefix Caching

The simple version of prompt caching is easy to say:

same prefix → reuse cached state

The implementation is more complicated.

A serving system has to answer several questions before reuse can happen:

How do we identify a matching prefix?
How do we store the KV state?
How do we route future requests back to the right cache? 
What happens when memory fills? 
Can we reuse anything beyond the prefix?

There is a family of methods to solve this.

Exact-prefix reuse

This is the basic case.

Two requests start with the same token sequence. The serving layer identifies the shared beginning and reuses cached state for that prefix.

Request 1: 
[stable system prompt]
[stable Skill][task A]  

Request 2: 
[stable system prompt]
[stable Skill][task B]

The shared prefix is:

[stable system prompt][stable Skill]

That is the part the system can reuse.

This is the model most API users need to understand first. If the beginning changes, the cacheable prefix shrinks or disappears.

Block-hash prefix caching

A serving system does not need to treat the prompt as one giant cache entry.

It can split the prompt into blocks.

Block 1: tokens 1-128
Block 2: tokens 129-256
Block 3: tokens 257-384

Each block can be associated with a hash. The hash can include both the block itself and the prefix that came before it.

That lets the system find the longest matching chain.

Block 1 matches
Block 2 matches
Block 3 changes

In that case, the server can reuse blocks 1 and 2, then recompute from block 3 onward.

This is why order matters. A block is not only “these tokens.” It is “these tokens after this prior prefix.”

vLLM’s prefix-caching docs describe this kind of design: processed requests leave behind KV-cache blocks, and later requests with the same prefix can reuse those blocks instead of recomputing them.

Paged KV cache

KV cache can get large.

Long prompts create large key/value state. Long-running agents create even more. Multiple concurrent users make the problem worse.

Paged KV cache treats cached state more like memory pages than one continuous allocation.

That matters because the serving system needs to allocate, reuse, share, and evict KV state efficiently. Without that, memory fragmentation and wasted GPU memory can become bottlenecks.

For a builder, the main point is simple:

Prompt caching is not only a matching problem. 
It is also a memory-management problem.

Prefix trees and radix caching

Some workloads share a common root and then branch.

Agents do this constantly.

shared agent harness   
├── research Skill   
│    ├── task A   
│    └── task B   
└── coding Skill        
     ├── task C    
     └── task D

A prefix tree stores shared beginnings once, then branches when the prompts diverge.

SGLang’s RadixAttention uses this kind of idea. It organizes reusable prompt state in a radix tree so shared prefixes can be found, reused, inserted, and evicted more efficiently.

This maps well to agent systems because agents are not random one-off prompts. They often reuse the same harness, then branch by Skill, task, tool, or phase.

Cache-aware routing

A cache hit only helps if the request reaches the place where the cached state lives.

In a distributed serving system, there may be many workers. One worker may have the cached prefix. Another may not.

If the next request lands on the wrong worker, the system may have to recompute the prefix or move cache state across machines.

That is why routing matters.

Application design gives the serving layer stable prefixes. Routing decides whether later requests reach the cache that already holds them.

Cache eviction

Caches cannot keep everything forever.

KV cache consumes memory, and GPU memory is expensive. The serving layer has to decide what to keep and what to evict.

Simple eviction policies may keep recent cache entries and discard older ones. More advanced policies may consider which prefixes are likely to be reused, how large they are, and how expensive they are to recompute.

This matters for agents because not all prompt sections have equal reuse value.

A stable Skill body may be reused thousands of times.

A one-off tool result may never be reused.

A cache-aware system should prefer keeping the first kind.

Beyond-prefix reuse

Most production prompt caching is built around exact prefixes.

But agent workloads are messier than that.

The same document chunk may appear in different positions. The same source may be reused across turns. The same tool result may show up in another branch of the workflow.

Classic prefix caching will not always catch that.

Newer work is exploring whether reusable KV state can be recovered from repeated segments, not just repeated beginnings. That is a harder problem because the model state for a segment depends on what came before it.

For now, the practical rule remains:

Design for prefix caching first.

Put stable content at the beginning. Keep it stable. Move dynamic context later.

The serving systems will keep getting better. But the builder can already do the most important thing:

Give the cache a stable prefix to reuse.

Skills as Cacheable Instruction Modules

In this post, a Skill means a reusable instruction module.

That could be a Claude Skill. It could be a Markdown file in an agent repo. It could be a prompt module loaded by Codex, a tool-specific operating procedure, or a workflow template inside an internal agent platform.

In most agent systems, a Skill eventually becomes text in the request:

[Skill purpose] 
[when to use it] 
[workflow] 
[output contract] 
[examples] 
[validation rules]

That text is usually stable.

The user task changes. The retrieved documents change. Tool results change. Run state changes.

But the Skill body often stays the same.

That makes Skills natural cache candidates.

A Skill is already meant to be reused at the instruction level. Prompt caching adds a second kind of reuse: the serving layer may be able to reuse the model state created from those same instructions.

That only works if the Skill is placed where the cache can use it.

A Skill loaded after dynamic context is still useful to the model, but it may not be useful to the prompt cache.

Cache-hostile layout: 
[dynamic docs] 
[current task] 
[Skill body]  

Cache-aware layout: 
[Skill body] 
[dynamic docs] 
[current task]

The content is the same. The cache behavior can be very different.

This is the design implication:

A Skill should not only be reusable as an instruction. 
It should be positioned as reusable prefix.

That does not mean every Skill should be loaded all the time. Large unused Skills create their own cost and context problems. Anthropic’s Skill system uses progressive disclosure: lightweight metadata helps the model decide whether a Skill is relevant, then the full Skill and supporting resources load only when needed.

That pattern still fits the caching argument.

Once a Skill is selected, its stable body should remain stable. Its dynamic inputs should come later.

Cache-Aware Skill Design

The design pattern is simple:

Cache the workflow. Vary the inputs.

A Skill usually contains the stable task frame:

purpose 
workflow 
output contract 
examples 
citation rules 
validation checklist 
source-handling rules

The current run supplies the changing inputs:

user request 
retrieved documents 
current files 
tool outputs 
timestamps 
run IDs 
temporary constraints

Those two categories should not be mixed casually.

A cache-aware Skill keeps the stable task frame intact and places dynamic material after it.

[stable Skill body] 
[dynamic sources] 
[current task input]

A cache-hostile Skill puts changing material first.

[timestamp] 
[run ID] 
[dynamic sources] 
[current task input] 
[stable Skill body]

This difference fundamentally changes what the model server sees as the reusable beginning of the request.

This does not mean every Skill should be loaded eagerly. Loading a large unused Skill just to make it cacheable can waste tokens. The better pattern is staged loading.

First, keep a small, stable routing layer:

available Skills 
when to use each Skill 
short descriptions selection rules

Then, once a Skill is selected, load the full stable Skill body before the dynamic task context.

[stable router] 
[selected stable Skill] 
[dynamic task context]

That gives the system two possible layers of reuse:

the router can be stable across many calls 
the selected Skill can be stable across repeated uses

This also helps with source diversity.

A research Skill may receive different articles every run. A repo Skill may receive different files. A data Skill may receive different schemas, queries, or results.

That variety belongs in the dynamic suffix.

The Skill should define how to use sources. The sources themselves should come later.

[Skill: how to read and cite sources] 
[Sources: the actual documents for this run]

The same applies to tools.

Tool definitions and tool-use rules should be stable. Tool results should be dynamic.

[stable tool definitions] 
[stable tool-use rules] 
[dynamic tool results]

The goal is not to optimize the prompt for caching at the expense of the task. The goal is to avoid wasting cacheability by accident.

If two prompt layouts are equally good for the model, choose the one that gives the serving layer more stable structure to reuse.

Benchmark Results

The benchmark showed that stable instruction modules placed at the front of the prompt became reusable prefixes, producing far more cache hits and materially reducing estimated warm-request cost.

This result is not only about Skills as a product concept. It applies to any stable instruction module: a Skill, workflow template, tool procedure, rubric, output contract, or source-handling guide.

I used a synthetic Skill body rather than a platform-native Skill object so the test could isolate layout: stable instruction module first versus dynamic context first.

The benchmark compared four layouts:

dynamic_first_cache_hostile
[timestamp][run ID][dynamic docs][task][stable Skill]  

stable_skill_first_cache_aware 
[stable Skill][dynamic docs][task][timestamp]  

stable_skill_first_deterministic_sources 
[stable Skill][dynamic docs ordered deterministically][task]  

dynamic_prefix_control 
[random run ID][stable Skill][dynamic docs][task]

Before interpreting the results, I checked that the prompts were constructed correctly. The stable-first prompts started with the Skill body. The dynamic-first prompts started with changing content. The stable Skill body stayed byte-for-byte identical. No cold request was contaminated by a prior cache hit.

The cache-hit split was clean:

Stable Skill-first layouts: 
19 / 20 warm cache hits  

Dynamic-first layouts: 0 / 20 warm cache hits

The token mix showed the practical difference.

dynamic_first_cache_hostile 
warm mean prompt tokens: 9,476.5 
warm mean cached tokens: 0 
warm mean fresh input tokens: 9,476.5  

stable_skill_first_cache_aware 
warm mean prompt tokens: 9,455 
warm mean cached tokens: 8,960 
warm mean fresh input tokens: 495

The same general amount of prompt context produced a different input profile. In the dynamic-first layout, every input token was processed fresh. In the stable Skill-first layout, most of the repeated instruction body became cached input.

Using OpenAI’s published GPT-4.1 mini prices at the time of writing, the estimated warm-request cost changed materially. The exact dollars are model- and date-specific, but the token economics are the point.

dynamic_first_cache_hostile about $0.00385 per warm request  
stable_skill_first_cache_aware about $0.00116 per warm request

That is roughly a 70% reduction in estimated warm-request cost for this synthetic benchmark.

The latency result was less clean. TTFT improved in the stable-first variants, but hosted API latency includes routing, queueing, server load, streaming behavior, network timing, and output generation. I would treat the latency numbers as directional, not guaranteed.

The stronger result is about cache eligibility and token economics:

Stable Skill-first layout:
high cache-hit rate 
high cached-token ratio 
low fresh-input-token count  

Dynamic-first layout: 
zero cache hits 
all input tokens processed fresh

That is the design point. The Skill body was not only instruction text. In the stable-first layout, it became a reusable prefix the serving layer could cache.

Closing

Prompt caching started as a pricing detail for me.

It is not just that.

For agent systems, it changes the design question.

Not only:

What context should the model have?

Also:

Where does that context live? How often does it change? Can the serving layer reuse it?

Skills make that question concrete.

A Skill is reusable guidance for the model. If it is written and positioned carefully, it can also become reusable work for the inference system.

That does not make Skills magic. It makes them a useful design boundary.

The stable part of the workflow can become the prefix.

The changing inputs can become the suffix.

That will not be the right layout for every task. Some systems need dynamic routing, safety state, permissions, or retrieved evidence earlier in the prompt. Some frameworks will reorder or compress context before the provider sees it.

So the point is not to worship stable prefixes.

The point is to know when you are breaking one.

Prompt caching gives agent builders a new thing to measure: not just answer quality, not just total tokens, but whether repeated work is actually being reused.

Another Coding Blog

Discussion about this post

Ready for more?