AI Caching • Open source

Self-tuning Valkey for AI agents

Exact-match and semantic caching for AI agents, backed by Valkey. Works with the OpenAI and Anthropic SDKs directly, plus adapters for LangChain, LangGraph, LlamaIndex, and (TS) Vercel AI SDK.

No modules required•Cluster mode supported•Redis-compatible•TypeScript + Python•Multi-modal native

Exact-match caching

npm

npm install @betterdb/agent-cache iovalkey

TypeScript. LLM, tool, and session tiers.

pip

pip install betterdb-agent-cache

Python. Same three tiers, same adapters.

Semantic caching

npm

npm install @betterdb/semantic-cache iovalkey

TypeScript. Similarity-based caching with valkey-search.

pip

pip install betterdb-semantic-cache

Python. Full feature parity.

Three cache tiers behind one connection

LLM Response Cache

Cache LLM responses by exact match on model, messages, temperature, and tools. Handles text, images, audio, and file content natively, and caches tool_use and tool_result blocks the same way. Second call returns from Valkey in under 1ms. Cost tracking per model built in.

{prefix}:llm:{sha256}

Tool Result Cache

Cache tool/function call results by tool name and argument hash. Per-tool TTL policies. Invalidate by tool or by specific arguments.

{prefix}:tool:{name}:{sha256}

Session State

Key-value storage for agent session state with sliding window TTL. Individual field expiry. LangGraph checkpoint support on vanilla Valkey - no RedisJSON, no RediSearch.

{prefix}:session:{thread}:{field}

Pluggable binary normalizer. Images, audio, and file content in multi-modal requests are included in the cache key by default. For image-heavy workloads, swap in a custom BinaryNormalizer to store blobs externally (S3, object storage) and cache by reference instead of by content - so Valkey memory stays bounded even as your multi-modal traffic grows.

Quick start

Up and running in under five minutes. No modules required.

import Valkey from 'iovalkey';
import { AgentCache } from '@betterdb/agent-cache';

const client = new Valkey({ host: 'localhost', port: 6379 });

const cache = new AgentCache({
  client,
  tierDefaults: {
    llm:     { ttl: 3600 },
    tool:    { ttl: 300 },
    session: { ttl: 1800 },
  },
});

// LLM response caching
const params = {
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is Valkey?' }],
  temperature: 0,
};

const result = await cache.llm.check(params);
if (!result.hit) {
  const response = await callLlm(params);
  await cache.llm.store(params, response);
}

// Tool result caching
const weather = await cache.tool.check('get_weather', { city: 'Sofia' });
if (!weather.hit) {
  const data = await getWeather({ city: 'Sofia' });
  await cache.tool.store('get_weather', { city: 'Sofia' }, JSON.stringify(data));
}

// Session state
await cache.session.set('thread-1', 'last_intent', 'book_flight');
const intent = await cache.session.get('thread-1', 'last_intent');

Works on vanilla Valkey 7+, ElastiCache, Memorystore, MemoryDB, and any Redis-compatible endpoint.

Drop-in framework adapters

Works with the tools you already use. No framework lock-in.

import OpenAI from 'openai';
import { hashOpenAIRequest } from '@betterdb/agent-cache/openai';

const openai = new OpenAI();
const params = {
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is Valkey?' }],
};

const cached = await cache.llm.check(hashOpenAIRequest(params));
if (cached.hit) return cached.response;

const response = await openai.chat.completions.create(params);
await cache.llm.store(hashOpenAIRequest(params), response.choices[0].message.content);

See what caching saves you

Built-in cost tracking shows exactly how much you're saving per model and per tool.

const stats = await cache.stats();
// {
//   llm:  { hits: 150, misses: 50, hitRate: 0.75 },
//   tool: { hits: 300, misses: 100, hitRate: 0.75 },
//   session: { reads: 1000, writes: 500 },
//   costSavedMicros: 12500000,  // $12.50 saved
//   perTool: {
//     get_weather: { hits: 200, misses: 50, hitRate: 0.8 },
//   }
// }

75%

LLM hit rate in this example

$12.50

Saved from 150 cache hits at gpt-4o pricing

<1ms

Cache hit latency vs seconds for a full LLM call

agent-cache also includes toolEffectiveness() which ranks your cached tools by hit rate and recommends TTL adjustments - increase, optimal, or decrease/disable - so caching stays efficient as your workload evolves.

You can also see the benefits live — we use this caching in our own BetterDB Chat.

TTL policies and self-optimization

Hit rate drives TTL. No manual tuning required.

const effectiveness = await cache.toolEffectiveness();
// [
//   { tool: 'get_weather', hitRate: 0.85, costSaved: 5.00,
//     recommendation: 'increase_ttl' },
//   { tool: 'search', hitRate: 0.6, costSaved: 2.50,
//     recommendation: 'optimal' },
//   { tool: 'rare_api', hitRate: 0.1, costSaved: 0.10,
//     recommendation: 'decrease_ttl_or_disable' },
// ]

Recommendation	Criteria
increase_ttl	Hit rate > 80% and current TTL < 1 hour
optimal	Hit rate 40-80%
decrease_ttl_or_disable	Hit rate < 40%

TTL follows a clear precedence: per-call TTL overrides per-tool policy, which overrides tier default, which overrides global default. When toolEffectiveness() recommends increase_ttl, apply it with cache.tool.setPolicy('get_weather', { ttl: 3600 }) - the policy persists to Valkey and takes effect immediately without restarting your application.

Agent-driven cache optimization

The cache is one more system the agent should be able to operate, not just call. An agent reads live cache state via MCP, proposes config changes with reasoning, and a human approves them - changes go live within seconds, no restart.

What the agent sees

cache_listList all caches visible to the agent with basic metadata

cache_healthHit rate, miss rate, latency, and key counts for a named cache

cache_threshold_recommendationRecommended threshold adjustment based on the rolling similarity distribution

cache_tool_effectivenessPer-tool hit rates, estimated cost savings, and TTL recommendations

cache_similarity_distributionRolling histogram of similarity scores for semantic caches

cache_recent_changesAudit trail of recent config changes and their observed outcomes

What the agent can propose

cache_propose_threshold_adjustPropose a new similarity threshold with machine-generated reasoning

cache_propose_tool_ttl_adjustPropose a TTL change for a specific tool based on hit rate data

cache_propose_invalidatePropose targeted invalidation for a cache namespace or key pattern

What stays in human hands (for now)

The approval gate is the v1 safety mechanism - no proposal takes effect without a human action. Autonomous mode (agent proposes and applies without approval) is the next step on the roadmap.

cache_list_pending_proposalsView all proposals currently awaiting approval

cache_get_proposalRetrieve details and reasoning for a specific proposal

cache_approve_proposalApprove a proposal - change is dispatched to Valkey within seconds

cache_reject_proposalReject a proposal with optional feedback for the agent

cache_edit_and_approve_proposalEdit a proposal's parameters and approve in a single action

Full working examples for both packages: semantic-cache example and agent-cache example. Cache intelligence requires the Feature.CACHE_INTELLIGENCE entitlement, which is part of the Pro tier.

Why teams choose BetterDB for agent caching

Three cache tiers behind one Valkey connection. No modules required.

Capability	@betterdb/agent-cache	LangChain RedisCache	LangGraph checkpoint-redis	AutoGen RedisStore	LiteLLM Redis	Upstash + Vercel AI SDK
Agent-tunable via MCP
Live config updates (no restart)
Multi-tier (LLM + Tool + State)		LLM only	State only	LLM only	LLM only	LLM only
Built-in OTel + Prometheus					Partial
No modules required			Redis 8 + modules			Upstash only
Base SDK support (OpenAI, Anthropic)
Multi-modal (images, audio, files)
Language support	TypeScript + Python	TS only	TS only	Python only	Python only	TS only
Framework adapters	OpenAI, Anthropic, LangChain, LangGraph, LlamaIndex, Vercel	LC only	LG only	AutoGen only	LiteLLM only	AI SDK only
Zero-config cost tracking	Bundled LiteLLM table, 1,900+ models

View on npm View on PyPI View in action

Full observability out of the box

Every cache operation emits an OpenTelemetry span and updates Prometheus metrics. Zero additional instrumentation.

OpenTelemetry spans

agent_cache.llm.check
agent_cache.llm.store
agent_cache.tool.check
agent_cache.tool.store
agent_cache.session.get
agent_cache.session.set
agent_cache.session.destroyThread

Prometheus metrics

agent_cache_requests_total- Total cache requests (hit/miss by tier)
agent_cache_operation_duration_seconds- Operation latency histogram
agent_cache_cost_saved_total- Estimated cost saved in dollars
agent_cache_stored_bytes_total- Total bytes stored
agent_cache_active_sessions- Approximate active session count

Cost tracking out of the box

Most caching libraries make you maintain your own pricing table. We ship one.

1,900+ models, zero config

Bundled pricing table sourced from LiteLLM's model_prices_and_context_window.json, refreshed on every release. GPT-4o, Claude, Gemini, and everything else LiteLLM tracks.

Override what you need

Pass a costTable to override pricing for specific models. Your entries merge on top of the defaults. Other models keep working.

Turn it off if you want to

Set useDefaultCostTable: false (TypeScript) or use_default_cost_table=False (Python) to bring your own table. Same behaviour as before 0.4.0.

See your cache working in the BetterDB monitor

Cache hit rate, similarity latency, and index health show up natively in BetterDB Monitor's Vector / AI tab. @betterdb/semantic-cache uses FT.SEARCH under the hood, so the monitor sees it automatically. One instance, no extra wiring.

Explore the monitor

Semantic caching for similar queries

“What is the capital of France?” and “Capital city of France?” are the same question. Semantic caching catches what exact-match misses.

Valkey-native

Handles valkey-search API differences explicitly. Works on ElastiCache, Memorystore, or self-hosted. Not a Redis port. Visualized in BetterDB Monitor's Vector / AI tab.

7 framework adapters

OpenAI, OpenAI Responses, Anthropic, LangChain, LlamaIndex, LangGraph, and Vercel AI SDK — no framework lock-in for direct SDK use.

Full observability

Every check() and store() emits OTel spans and Prometheus metrics. Hit rate, similarity scores, latency - zero extra instrumentation.

Cost tracking, zero config

Bundled LiteLLM price table, 1,900+ models. Store token counts at cache time and get exact dollars saved on every hit — including cumulative stats via cache.stats().

TypeScript + Python

Full parity. Same adapters, same API shape, same features in both languages. Install with npm or pip.

Auto-tuning thresholds

thresholdEffectiveness() analyzes the rolling similarity score window and returns a tighten/loosen/optimal recommendation. With MCP-driven cache intelligence, an agent reads this recommendation and proposes a threshold adjustment - a human approves in BetterDB Monitor, and the library picks it up within seconds. See the closed-loop example.

No other semantic cache library checks all six.

Capability	RedisVL SemanticCache	LangChain RedisSemanticCache	LiteLLM redis-semantic	Upstash semantic-cache	Redis LangCache
Agent-tunable via MCP
Live config updates (no restart)
Valkey-native	Redis only	Redis only	Redis only	Upstash only	Redis Cloud only
Standalone		Requires LangChain	Requires LiteLLM		Managed only
Built-in OTel + Prometheus			Partial		Dashboard only
TypeScript + Python	Python only	Requires LangChain	Python only	JS/TS only	Managed only
Cost tracking (bundled)			Via LiteLLM only

import { SemanticCache } from '@betterdb/semantic-cache';
import { createOpenAIEmbed } from '@betterdb/semantic-cache/embed/openai';

const cache = new SemanticCache({
  client: new Valkey({ host: 'localhost', port: 6399 }),
  embedFn: createOpenAIEmbed(), // or Voyage, Cohere, Bedrock, Ollama
  defaultThreshold: 0.15,       // catch paraphrases with high confidence
});

await cache.initialize();
await cache.store('What is the capital of France?', 'Paris', {
  model: 'gpt-4o', inputTokens: 20, outputTokens: 5,
});

const result = await cache.check('Capital city of France?');
// result.hit === true
// result.confidence === 'high'
// result.costSaved === 0.000105

Five embedding helpers included: createOpenAIEmbed, createVoyageEmbed, createCohereEmbed, createBedrockEmbed, createOllamaEmbed. Requires valkey-search (Valkey 8+ or via modules). For environments without search modules, use @betterdb/agent-cache for exact-match caching.

View @betterdb/semantic-cache on npm →View betterdb-semantic-cache on PyPI →

Known limitations

These apply to @betterdb/agent-cache and betterdb-agent-cache.

Streaming responses are not cached by the Vercel AI SDK adapter. Accumulate the full response before caching.

LangChain streaming is similarly not cached. The BetterDBLlmCache adapter caches complete generation results including token counts. If your LangChain model uses .stream() instead of .invoke(), responses bypass the cache. Use .invoke() for cacheable calls.

LangGraph list() loads all checkpoint data for a thread into memory before filtering. Fine for typical agent deployments. For threads with thousands of large checkpoints, consider langgraph-checkpoint-redis with Redis 8+.

Ready to get started?

Start monitoring in minutes - no infrastructure to maintain. Team collaboration, agent-based monitoring for private databases, and more. Or self-host - open source core, zero lock-in.

Get Started for Free View Documentation View on GitHub

Self-tuning ValkeyValkeyRedis for AI agents