The cache that tunes itself: three autonomous optimization runs and what they found

The optimizer reads cache health and similarity data from BetterDB Monitor, issues durable proposals with reasoning attached, and either auto-approves them or queues them for human review through BetterDB Monitor's cache optimization feature - the Monitor applies each approved change to Valkey directly, and a 30-second configRefresh loop in each cache library picks it up in the running app. No deploy. It runs every 6 hours via Vercel Cron. Three runs in, it converged: run 1 made fifteen tool calls and three configuration changes, run 2 made thirteen and re-proposed the same changes, run 3 made eight and stopped. The interesting thing is not that it converged. It is what the repetition between runs 1 and 2 told me about a problem the optimizer could not fix on its own.

I built chat.betterdb.com with two caches underneath - a semantic cache for full LLM responses and an agent cache wrapping every model call and every tool call in the RAG pipeline. I wrote about the architecture and the similarity measurements here. That post was about building the stack and measuring it. This post is about the tuning.

What is running

Two caches, both backed by the same Valkey instance.

playground_scache is a semantic cache built on @betterdb/semantic-cache@0.4.0. Every incoming chat query gets embedded with text-embedding-3-small, compared against stored embeddings via Valkey Search, and if the cosine distance is below the threshold the cached LLM response comes back immediately. No model call. The cache stores the full response alongside input and output token counts - on a hit it knows exactly how much the original call would have cost and reports that as savings.

playground is an agent cache built on @betterdb/agent-cache@0.6.0. This one is more layered than it might seem. It has three tiers.

The LLM tier wraps every model call via createAgentCacheMiddleware. On a cache miss, it stores the full LLM response with token counts. On a hit, it returns the cached text and reports zero tokens to the caller - the model was never invoked. Cost savings are computed automatically from the bundled LiteLLM price table against the stored token counts. Same mechanism as the semantic cache, but at the exact-match level: it hits when the same exact conversation - messages, model, temperature, tools - is seen again.

The tool tier caches the results of individual tool calls: documentation searches, command references, module lookups. Each call is hashed by its arguments and stored as an exact-match entry with a TTL. Cost per call is passed in at store time - embedding cost for search-based tools, zero for pure Valkey text lookups. The hit rate here depends entirely on how repetitive the tool arguments are across users.

Both libraries shipped new versions last week. The important thing in both: configRefresh. Every 30 seconds, each cache re-reads its own configuration from Valkey. Threshold for the semantic cache. TTL policies per tool for the agent cache. When something writes a new value into Valkey, the running app picks it up within 30 seconds. No restart. No deploy. That is the primitive that makes autonomous tuning meaningful - there is a write path from the optimizer directly to the application.

The optimizer

It runs every 6 hours as a Vercel Cron job at /api/optimize. The route is a generateText call with GPT-4o-mini and a set of tools wrapping the BetterDB Monitor REST API.

The model can read cache health, threshold recommendations, similarity score distributions, and per-tool effectiveness data. It can write through a proposal workflow - propose a change, get a proposal ID, immediately call approve. The Monitor applies the change to Valkey. The cache picks it up on the next configRefresh tick.

The system prompt encodes the policy: only touch the semantic threshold if you have 100+ samples and the recommendation is actionable. Only touch tool TTLs if a tool has 20+ ops and is clearly underperforming. Check recent changes before proposing to avoid duplicates. The floor on proposal volume matters - a recommendation on 30 samples is noise. The agent knows this and skips.

Run 1: what the first pass found

26 seconds. 15 tool calls.

list_caches →
  playground       (agent_cache,   26.4% hit rate, 398 ops)
  playground_scache (semantic_cache, 72.8% hit rate, 534 ops)

Two caches live. Very different hit rates.

Semantic cache: optimal, leave it alone

cache_health (playground_scache):
  hit_rate:            72.8%
  uncertain_hit_rate:  12.1%
  cost_saved:          $0.031

threshold_recommendation:
  sample_count:        306
  current_threshold:   0.08 cosine distance
  recommendation:      "optimal"
  reasoning:           "Hit rate is 75.5% with 12.1% uncertain hits -
                        threshold appears well-calibrated."

72.8% hit rate. $0.031 saved in avoided LLM calls across 534 operations - small in absolute terms, but those operations came from roughly 10-20 chat sessions. Per active session the savings scale linearly, and at any real traffic level the cache pays for the model calls underneath the optimizer itself several times over. 12.1% of hits in the uncertainty band - close to the threshold, served from cache, but borderline. The algorithm triggers a tighten recommendation above 20% uncertain hits. We are at 12.1%. The model left it alone.

Correct. I would have made the same call.

The 12.1% is the number to watch. It is not alarming today. If it climbs toward 20% as traffic patterns shift, the threshold is probably too loose. The agent will catch it before I do.

Agent cache: the tool tier problem

cache_health (playground):
  hit_rate:    26.4%
  cost_saved:  tracked in microdollars - rounds to $0.00 at current volume

tool_effectiveness:
  get_module_info:       50% hit rate  → optimal
  get_command_reference: 36% hit rate  → optimal
  search_docs:            5% hit rate  → decrease_ttl_or_disable
  get_betterdb_info:      4% hit rate  → decrease_ttl_or_disable
  compare_commands:       3% hit rate  → decrease_ttl_or_disable

26.4% aggregate hit rate looks bad. The per-tool breakdown shows that it is not uniformly bad - it is specifically bad in three tools.

get_module_info at 50% and get_command_reference at 36% are working. These are deterministic lookups. "What modules does Valkey support?" has a stable answer. Users and agents hitting the same module or command twice get a cache hit. The exact-match tier is the right place for these.

The three at sub-5% are a different shape of problem. search_docs takes a natural-language query. Two users asking about "XADD performance" and "how fast is stream appending" produce different exact strings. The exact-match tool tier never fires. The semantic cache is already handling query-level deduplication at 72.8%. The tool tier for search_docs is not just underperforming - it is in the wrong tier for that type of query.

The agent read this correctly and proposed reducing the TTL on all three to 60 seconds.

propose_tool_ttl_adjust (search_docs, 60s)      → proposal 05c2ab89 → applied ✓
propose_tool_ttl_adjust (get_betterdb_info, 60s) → proposal d036f118 → applied ✓
propose_tool_ttl_adjust (compare_commands, 60s)  → proposal 0c0eb23e → applied ✓

The Monitor wrote {"ttl": 60} for each tool into playground:__tool_policies in Valkey. The running app's configRefresh loop picked it up within 30 seconds. No deploy. No restart. No manual step.

Run 2: the scheduled cron fires at midnight

28 seconds. 13 tool calls. This time via method=GET from the Vercel Cron scheduler.

list_caches →
  playground        (agent_cache,    26.2% hit rate, 404 ops)
  playground_scache  (semantic_cache, 72.9% hit rate, 537 ops)

Nearly identical numbers 6 hours later. The semantic cache is steady. The agent cache hit rate has barely moved.

Semantic cache: still optimal

threshold_recommendation:
  sample_count:        307  (one more turn since run 1)
  hit_rate:            75.2%
  uncertain_hit_rate:  11.7%
  recommendation:      "optimal"

Consistent. The model skipped it again. Right call again.

Agent cache: the same three tools

The agent called recent_changes on the playground cache. It found the proposals from run 1 - status applied. The duplicate check only blocks pending proposals, not applied ones. The previous changes had already gone through, so the check passed.

The tool effectiveness numbers were unchanged. search_docs still at 5%. get_betterdb_info still underperforming. compare_commands still near zero.

propose_tool_ttl_adjust (search_docs, 60s)       → proposal 4176e002 → applied ✓
propose_tool_ttl_adjust (get_betterdb_info, 60s)  → proposal 56e4cfc3 → applied ✓
propose_tool_ttl_adjust (compare_commands, 60s)   → proposal a84d7d3a → applied ✓

It proposed the same changes again. At 60 seconds. Which was already the TTL. Applied successfully. Nothing actually changed in Valkey - the policies were already 60s from run 1.

This is not a bug. It is an important signal about the shape of the problem.

Run 3: the optimizer stops repeating itself

22 seconds. 8 tool calls. Five hours after run 2.

list_caches →
  playground        (agent_cache,    26.2% hit rate, 404 ops - identical to run 2)
  playground_scache  (semantic_cache, 73.1% hit rate, 537 ops)

The semantic cache threshold check came back optimal again. 296 samples in the similarity window - slightly fewer than run 2's 307, because the window trims entries older than 7 days and a handful rolled off. Hit rate in the window: 76.4%. Uncertain hits: 11.5%, down from 11.7%. Moving in the right direction. Skipped.

The agent cache ran the same analysis. Tool effectiveness unchanged. Then it called recent_changes three times in parallel - once per underperforming tool.

This time it stopped.

recent_changes → proposals from run 2 found: a84d7d3a (compare_commands, applied)
                                              56e4cfc3 (get_betterdb_info, applied)
                                              4176e002 (search_docs, applied)

[no new proposals]

run complete - toolCalls=8 duration=22273ms

The model's summary: "Previously proposals adjusted TTL to 60 seconds" - it saw the recent history, recognized the changes had just been made, and decided not to repeat them.

Run 2 had seen run 1's proposals and re-proposed anyway. Run 3 saw run 2's proposals - more recent, clearly just applied - and stopped. The difference is time and recency. There is no code-level mechanism driving this - the system prompt instructs the model to skip only if a proposal is pending, not applied, so run 2 and run 3 received the same type of data and the same instruction, and run 3's model instance simply reasoned about it differently. The optimizer converged.

Three runs: 15 tool calls → 13 → 8. The system is now stable.

What the repetition tells you

The agent made the same proposals in run 2 because the hit rate had not moved. And the hit rate had not moved because TTL is not the problem.

search_docs has a 5% hit rate because users rephrase queries differently every time. Reducing the TTL from 86400 seconds to 60 seconds made entries expire faster. It did not change the fact that two semantically identical queries produce different exact strings and the exact-match tier misses. That problem does not respond to TTL tuning. It responds to using the right cache tier.

The optimizer is doing what it was told. The constraints say: if a tool has a low hit rate, reduce its TTL. So it reduces the TTL. Then on the next run the hit rate is still low, so it reduces the TTL again - to the same value. This will keep happening on every run until the floor is removed or the structural issue is fixed.

The fix is not a better optimizer. It is routing search_docs through the semantic cache tier, which already handles query-level similarity matching at 72.8%. The tool cache should not be in the picture for natural-language queries.

The agent cannot propose that with the current tool set. It can only touch TTLs, not routing. That is a deliberate constraint - routing changes require code changes, not config changes. But the repeated proposals are a clear signal that the data is telling you something the optimizer cannot act on.

What the agent got right, and where the limit is

Everything the agent touched was defensible. The semantic cache was left alone across all three runs. The deterministic tools with reasonable hit rates were left alone. The high-variance tools had their TTLs reduced to something sensible.

The conservative bias toward "reduce TTL" rather than "disable" reflects the floor in the system prompt - 60 seconds minimum. I would have crossed that line on search_docs and disabled the tool cache entry entirely. The agent cannot. That is on me.

The deeper limit: the optimizer acts on what is measurable and writable. It can see hit rates, similarity scores, cost in microdollars, TTL policies. It can write threshold values and TTL policies. It cannot see that search_docs is architecturally mismatched to the wrong cache tier, and it cannot write a routing change. Autonomous optimization works within the space of tunable parameters. The useful decisions outside that space still require a human who reads the pattern in the data.

The repetition across runs is the optimizer doing its job. It is also the optimizer telling you something it cannot fix itself.

What humans still own

Manual tuning is reactive by construction. You notice a problem after the fact. You check the dashboard. You push a fix. You wait to see if it helped.

That loop runs at human speed. Your cache configuration is always trailing your traffic.

An autonomous loop that runs every 6 hours and acts on current data closes that gap - for the class of problems it can act on. The agent is not smarter than you. It is faster at closing the feedback loop on the decisions that are already well-defined. The semantic threshold drifting as query patterns shift. A tool's TTL needing a bump because usage patterns changed. These are decisions that should not require you to notice and remember to check.

The decisions that require understanding - why a tool is mismatched to its tier, whether a structural routing change is the right fix - those still land on you. The optimizer surfaces the signal. You decide what it means.

What is next

Route search_docs through the semantic cache. The exact-match tool tier will never produce useful hits for natural-language queries. The semantic cache already handles this at ~73%. The fix is a few lines in the tool definition. Three runs of the optimizer pointing at the same tool is the data telling me to do it.

How much agency to give it. Right now the optimizer can touch thresholds and TTLs. The next obvious expansion is letting it disable a tool cache entry entirely, not just floor its TTL at 60 seconds - that would have given it the correct call on search_docs instead of the next-best one.

Past that, the question gets less obvious. Should the optimizer be able to change cache routing? Promote a tool from exact-match to semantic? Adjust per-category thresholds independently? Each step expands what it can fix without you, and each step expands what it can break without you. The right answer is not maximum agency or minimum agency. It is a moving line that depends on how much you trust the proposal-and-approve workflow, how reversible each class of change is, and how much of your day you want to spend reading optimizer summaries instead of writing code.

The version of this loop that ships with BetterDB has the boundary set conservatively: tunable parameters, yes; structural changes, no. That is a defensible default, not a permanent one.

The uncertain_hit_rate on playground_scache is at 11.5% and falling - 12.1% in run 1, 11.7% in run 2, 11.5% in run 3. Moving away from the 20% tighten threshold, not toward it. Holding as traffic grows.

The chat app is at chat.betterdb.com. The libraries are @betterdb/agent-cache and @betterdb/semantic-cache on npm, and betterdb-agent-cache and betterdb-semantic-cache on PyPI, all MIT licensed. The optimizer source is in the chat repo on GitHub.