BetterDB
Back to blog

The Glue Layer Most People Don't Design Intentionally: Caching, State, and Session Management in AI Apps

Kristiyan Ivanov

Your AI app works great in dev. Then LLM costs spike, latency doubles, and your agent forgets mid-task. The problem is the layer between your app and your model.

The Glue Layer Most People Don't Design Intentionally: Caching, State, and Session Management in AI Apps

Your AI app works great in development. The LLM responds quickly, context carries across turns, costs are manageable. You ship it. Three weeks later, your LLM bill looks like a ransom note, response times have quietly doubled, and your "stateful" agent forgot what it was doing halfway through a multi-step task.

The problem isn't your model choice. It's not your prompt engineering. It's the layer between your application and your model - the caching, the state management, the session persistence. You built it as an afterthought, and now it's load-bearing.

It's the layer most people don't design intentionally. It just kind of happens, usually as a series of "we'll fix this later" decisions that calcify into architecture. This article walks through each primitive in that layer, what it's actually doing under the hood, and how to implement it before production pressure forces your hand.

A quick note before we get into it: if you want the "why does this matter architecturally" context, I covered that in a companion piece - The Database Decision Your AI Stack Gets Wrong Before You Write a Line of Code. This article is the practical follow-up. Less argument, more code.


1. Semantic caching: stop paying for the same answer twice

What string-match caching gets wrong

Most teams start with exact-match caching: hash the prompt, store the response, return on match. It works for identical inputs. It's useless for anything else.

"What's the capital of France?" and "Which city is France's capital?" are the same question. A string-match cache fires a full LLM call for both. At scale, you're paying twice - in latency and in cost - for identical work.

Semantic caching stores the embedding of each query alongside the response. On every new request, you embed the incoming query and run a similarity search against your cached embeddings. If the match exceeds your threshold, you return the cached response. No LLM call needed.

The economics are hard to ignore. AWS benchmarked an 86% reduction in inference costs using this approach with Claude 3 Haiku. Cache hits are 9 to 65 times faster than a full LLM call. The embedding cost per query (~$0.00002 for text-embedding-3-small) is noise compared to the LLM call cost it replaces.

Implementation

The fastest path is LangChain's RedisSemanticCache, which hooks into any LangChain LLM call automatically:

from langchain_redis import RedisSemanticCache
from langchain_openai import OpenAIEmbeddings
from langchain_core.globals import set_llm_cache

semantic_cache = RedisSemanticCache(
    embeddings=OpenAIEmbeddings(),
    redis_url="redis://localhost:6379",  # works with Valkey too
    distance_threshold=0.1,  # cosine distance - ~0.9 similarity
    ttl=3600
)
set_llm_cache(semantic_cache)

# All subsequent llm.invoke() calls now go through the cache automatically

If you want more control - custom embedding pipelines, per-query thresholds, inspection tooling - RedisVL gives you that. It works with Valkey out of the box via protocol compatibility:

from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(
    name="llmcache",
    redis_url="valkey://localhost:6379",  # Valkey-native
    distance_threshold=0.1,
    ttl=3600
)

result = cache.check(prompt="What is the capital of France?")
if not result:
    response = call_llm(prompt)
    cache.store(prompt=prompt, response=response)

The threshold problem - and why you'll get it wrong the first time

The similarity threshold is the most important parameter nobody warns you about. In a production banking deployment, the default threshold of 0.7 produced a 99% false positive rate - virtually every cache hit was incorrect. And the uncomfortable finding: threshold tuning alone won't fix it. Achieving a sub-5% false positive rate requires domain-specific embedding models, a deliberate cache content strategy, and validation logic on top of threshold configuration.

Start at 0.90. Test against 100 representative queries from your actual domain. Measure false positive rate explicitly - not just hit rate. The goal isn't a high cache hit rate. It's a high correct cache hit rate.

A few other gotchas worth knowing:

Pin your embedding model. Switching from text-embedding-ada-002 to text-embedding-3-small invalidates your entire cache - vectors from different models live in incompatible spaces. Treat the embedding model like a database schema: version it, migrate deliberately.

TTL is a product decision. "Capital of France" can be cached for days. "Current promotions" needs 30 seconds. Time-sensitive queries need short TTLs or you'll serve stale answers confidently.

Cache warming with TopK. An empty cache means 0% hit rate at cold start. Track your most frequent queries with Valkey's TopK data structure and pre-compute responses for the top 100:

# Track query frequency
r.topk().reserve("popular_queries", k=100)
r.topk().add("popular_queries", incoming_query)

# Nightly warming job
popular = r.topk().list("popular_queries")
for query in popular:
    if not cache.check(prompt=query):
        cache.store(prompt=query, response=call_llm(query))

2. Session and context state: context is infrastructure

Why Postgres is the wrong default here

Every multi-turn conversation is a state management problem. The naive answer - a Postgres table with a session_id foreign key - works fine until it doesn't.

At 1,000 concurrent users each making 5-10 turns in an active session, you've created a synchronous read dependency on Postgres in your critical path, on every single message. The reads aren't complex. They don't need joins or transactions. They just need to be fast - and that's exactly what Postgres wasn't optimized for when your data is hot, ephemeral, and high-concurrency.

The right data structures

Different session patterns call for different primitives:

Lists - simple append-only chat history. Fast, minimal overhead.

r.rpush(f"chat:{session_id}", json.dumps(message))
messages = r.lrange(f"chat:{session_id}", 0, -1)

JSON (via Valkey-JSON) - messages with tool call results, metadata, arbitrary structure. The right default for anything non-trivial.

r.json().arrappend(f"session:{session_id}:messages", "$", message)

Hashes - session metadata alongside the conversation.

r.hset(f"session:{session_id}:meta", mapping={
    "model": "gpt-4o",
    "user_id": user_id,
    "token_count": token_count
})

Streams - event sourcing and audit logs. If you need a durable record of every tool call, every agent action, every state transition, Streams give you that without the overhead of a full database write on every event.

Sliding window with auto-trim

The pattern you'll need on every project - keep the last N messages, reset TTL on activity, expire dead sessions automatically:

pipe = r.pipeline()
pipe.rpush(f"chat:{session_id}", json.dumps(message))
pipe.ltrim(f"chat:{session_id}", -20, -1)   # keep last 20 messages
pipe.expire(f"chat:{session_id}", 7200)      # 2-hour sliding TTL
pipe.execute()

The pipeline() call is important - it batches all three operations into a single round trip. Don't skip it.

LangGraph + Valkey for agent workflows

For anything more complex than simple Q&A - multi-step agents, tool use, parallel branches, resumable workflows - LangGraph's Redis checkpointer is the right abstraction. It handles serialization, TTL, and crash recovery out of the box, and it works with Valkey given the protocol compatibility:

from langgraph.checkpoint.redis import RedisSaver

saver = RedisSaver.from_conn_string(
    "redis://localhost:6379",
    ttl={"default_ttl": 60}  # minutes
)
saver.setup()

graph = workflow.compile(checkpointer=saver)
config = {"configurable": {"thread_id": "user-123"}}

# State is automatically persisted and resumable across calls
result = graph.invoke(
    {"messages": [HumanMessage(content="Start the analysis")]},
    config=config
)

If the agent crashes mid-task, the next invocation picks up exactly where it left off. Without a checkpointer, that's a full restart. LlamaIndex has a RedisChatStore with similar capabilities, and AutoGen 0.4+ provides RedisMemory with vector search for semantic agent memory.

TTL strategy as a product decision

This is the one nobody thinks about until sessions are mysteriously expiring mid-task:

  • Interactive chat sessions: 30 min - 2 hours, sliding TTL (reset on every message)
  • Active agent workflows: hours to days, pin while running - use a keepalive pattern
  • Long-term user preferences: persistent, evict by LRU policy
# Keepalive pattern for long-running agent tasks
def keepalive(session_id: str, ttl: int = 86400):
    r.expire(f"session:{session_id}", ttl)

# Call this on every agent action to prevent mid-task expiry

3. KV cache offloading: the one only vLLM users need - but really need

Skip this section if you're calling OpenAI or Anthropic APIs. If you're running vLLM or SGLang in production, don't.

What the KV cache is and why it fills up

During autoregressive generation, each transformer layer generates Key and Value tensors for every token in the prompt. These tensors - the KV cache - are what allow the model to "remember" previous tokens without reprocessing them.

They're enormous. Llama 3.1 70B at 128K context generates roughly 40GB of KV tensors per request. Serve just 4 concurrent users at that context length and you need 160GB just for the cache - more than the model weights themselves.

When GPU memory fills up, vLLM evicts cached KV blocks. The next request that needs those blocks has to recompute them from scratch - paying the full forward pass cost again. NVIDIA reports this can be up to 14x slower than cache retrieval.

LMCache + Valkey: one config file

LMCache is an open-source KV cache management layer that stores evicted blocks externally - first in CPU DRAM, then in Valkey for persistence across instances. On cache hit, blocks are injected directly into the model, skipping the forward pass entirely for those chunks.

The configuration is almost insultingly simple:

# lmcache_config.yaml
chunk_size: 256
remote_url: "valkey://localhost:6379"
remote_serde: "naive"
from vllm import LLM
from vllm.config import KVTransferConfig

ktc = KVTransferConfig(
    kv_connector="LMCacheConnectorV1",
    kv_role="kv_both",
)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", kv_transfer_config=ktc)

That's it. Valkey now acts as the persistent KV cache layer across your inference fleet.

The numbers

Google Cloud benchmarked LMCache + Valkey on Llama-3.3-70B-Instruct across 8x H100 GPUs. At 100K token prompts: -79% time-to-first-token, +264% throughput. VAST Data measured TTFT dropping from over 11 seconds to 1.5 seconds at 128K context.

This isn't a marginal optimization. At long context lengths, it's the difference between a usable product and an unusable one.


4. Valkey Search: when your vector lookups need to be fast

The pgvector cliff

pgvector is the default answer for vector search and it's a reasonable one - until you hit the cliff. When your HNSW index exceeds your Postgres working memory and starts paging to disk, latency doesn't degrade gracefully. Benchmarks on vanilla pgvector show P95 latency spiking to multiple seconds under concurrent load - the numbers get ugly fast. Modern pgvector with careful tuning improves this significantly, but the fundamental constraint remains: when your index exceeds working memory, latency doesn't degrade gracefully.

For inference hot-path vector lookups - semantic cache checks, real-time RAG retrieval, recommendation queries - "sometimes seconds" isn't a performance problem. It's a product problem.

Valkey Search (v1.0.0, released July 2025) runs entirely in memory. Typical query latency: 1-5ms. P99 stays under 10ms under concurrent load. No disk to hit, no buffer pool to miss. And with Valkey Search 1.2 (March 2026), the scope has expanded significantly beyond vectors - full-text search, tag filtering, numeric range queries, and aggregations are now all first-class citizens, combinable in a single query.

Creating an index and querying it

from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

schema = {
    "index": {"name": "docs", "prefix": "doc"},
    "fields": [
        {
            "name": "embedding",
            "type": "vector",
            "attrs": {
                "algorithm": "hnsw",
                "dims": 1536,
                "distance_metric": "cosine",
                "m": 16,
                "ef_construction": 200
            }
        },
        {"name": "category", "type": "tag"},
        {"name": "price", "type": "numeric"}
    ]
}

index = SearchIndex.from_dict(schema)
index.create(overwrite=True)

# Basic vector query
query = VectorQuery(
    vector=embedding,
    vector_field_name="embedding",
    num_results=5
)
results = index.query(query)

Hybrid search - the real differentiator

The feature that matters most for production inference workloads is hybrid search - combining vector similarity with full-text, tag, and numeric filters in a single query. Valkey Search 1.2 makes this a first-class primitive. The query planner handles selectivity automatically, choosing between pre-filtering and inline-filtering without you having to think about it:

FT.SEARCH productIndex
  "noise cancelling earphones @category:{electronics} @price:[50 150] =>[KNN 5 @vector $query_vector]"
  PARAMS 2 query_vector "$encoded_vector"
  SORTBY __vector_score
  DIALECT 2

That single query combines full-text search over descriptions, a tag filter on category, a numeric range on price, and vector similarity on embeddings. No round-trips, no stitching results together in application code.

When to use which

Use Valkey Search for inference hot-path lookups: semantic cache vector checks, real-time RAG retrieval where every added millisecond degrades user experience, fraud detection, real-time recommendations.

Use pgvector when you have fewer than 1M vectors with relaxed latency requirements, when you're already in Postgres and want to avoid new infrastructure, or when you need to JOIN vector results with relational data. Both are legitimate tools. The mistake is using pgvector on the inference hot path because it was already there.


5. The primitives that complete the picture

Rate limiting: tracking tokens, not just requests

LLM APIs charge by tokens, not requests. A single user can send one request that costs as much as 500 normal ones. Application-layer rate limiting or a Postgres counter table under concurrent load adds latency and creates race conditions.

Valkey sliding window counters with Lua scripts give you atomic multi-dimensional rate limiting - prompt TPM, output TPM, per-user, per-org - in a single round trip:

sliding_window_script = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local tokens = tonumber(ARGV[4])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local current = redis.call('ZCARD', key)

if current + tokens > limit then
    return 0  -- rate limited
end

redis.call('ZADD', key, now, now .. '-' .. math.random())
redis.call('EXPIRE', key, window)
return 1  -- allowed
"""

allowed = r.eval(
    sliding_window_script, 1,
    f"ratelimit:{user_id}",
    int(time.time()), 60, 100000, token_count
)

Bloom filters for deduplication

Check whether a prompt has already been processed without storing every prompt - Bloom filters give you membership testing with up to 98% memory savings versus a SET:

# Create filter: 0.1% false positive rate, 1M capacity
r.bf().reserve("seen_prompts", 0.001, 1_000_000)

prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()

if not r.bf().exists("seen_prompts", prompt_hash):
    r.bf().add("seen_prompts", prompt_hash)
    process_prompt(prompt)
# else: skip - already handled

Useful for deduplicating webhook-triggered inference jobs, filtering known-bad prompts before they hit the model, and avoiding redundant processing in event-driven pipelines.

Streams for token delivery and worker coordination

Distributing inference jobs across GPU workers without message loss or duplication:

# Producer: enqueue inference request
await r.xadd("inference_queue", {
    "prompt": prompt,
    "session_id": session_id,
    "model": model
})

# Worker: consume with consumer group - crash-safe
msgs = await r.xreadgroup(
    "gpu_workers", "worker1",
    {"inference_queue": ">"}, count=1, block=5000
)

# Process, then acknowledge - message won't be redelivered
if msgs:
    stream, entries = msgs[0]
    entry_id, data = entries[0]
    await process_inference(data)
    await r.xack("inference_queue", "gpu_workers", entry_id)

The consumer group model means if a GPU worker crashes mid-inference, the message gets reassigned to another worker automatically via XAUTOCLAIM. No duplicate processing, no lost jobs.

Streams also solve the resumable token streaming problem elegantly - LLM generation writes tokens to a Stream as they're produced, clients read via XREAD with their last-seen ID. If the client disconnects and reconnects, they resume from exactly where they left off.


How it all connects

Here's what the full inference pipeline looks like when you've wired these primitives together properly:

Request arrives → Bloom filter checks for duplicate → Rate limiter checks token budget → Semantic cache checks for similar prior response → On miss: Stream queues to GPU worker → LMCache + Valkey fetches KV blocks for fast TTFT → Tokens stream back via Stream + Pub/Sub → Session state persists conversation → TopK tracks query for cache warming

And with Valkey Search 1.2's new aggregations support, you can now compute cache hit rate distributions, query frequency by category, and latency percentiles directly on your Valkey Search indexes - without exporting data to a separate analytics layer. The monitoring layer is increasingly collapsing into the same infrastructure as the inference layer itself.

Each hop is handled by the primitive that's actually good at that job. None of it is Postgres's job. Not because Postgres is bad at what it does - it's exceptional at what it does - but because none of these operations benefit from transactions, joins, or durable disk storage. They all need memory-speed access under concurrent load.

The engineers who build this layer intentionally ship AI products that feel fast and cost-efficient from day one. The ones who bolt it on later spend weeks firefighting in production while their LLM bill climbs.


A note on tooling

At BetterDB, we're building Valkey-first monitoring and tooling for exactly this kind of infrastructure. We recently shipped support for working with vector search directly - including Valkey Search 1.2's expanded hybrid search and aggregation capabilities - with more AI-focused primitives on the roadmap. If you're running Valkey for AI workloads and want visibility into what's actually happening - cache hit rates, latency distributions, index health - that's what we're building.

For the architectural context behind why Valkey wins the inference infrastructure debate, the companion piece is here: Valkey's Moment: Why AI Inference Needs What Postgres Can't Deliver.