A week benchmarking LLM caches against RedisVL and Upstash. Here is what held up.

Over the past week we benchmarked our semantic cache against the two obvious peers, RedisVL (Redis Inc.'s open-source Python library) and Upstash's semantic cache, and we stress-tested our own threshold auto-tuning loop. Four public datasets, three writeups, and a fair amount of being wrong before we were right. This post is the summary. Each finding links to the deep-dive with the full tables and methodology.

We are publishing the parts where we are only at parity, not just the parts where we win. There was no honest cross-library comparison of semantic caches anywhere we could find, so we made one.

The short version

Quality is a tie, and that is the point. Within roughly one percentage point of F1 against both RedisVL and Upstash across every dataset. Cache quality is bounded by the embedding model, not the library, so parity is the ceiling. Reaching it is the price of admission.
The similarity threshold is not portable. The same embedding model produced different score distributions on different runtimes, so the optimal threshold was different on each. A number you copy from a tutorial or a vendor default is very likely wrong for your setup.
Auto-tuning earns its keep, but "do no harm" is the hard part. Up to +2.8% F1 where the starting threshold was bad. The real engineering was the safety logic that stops it from tuning itself off a cliff.
Latency is mostly about where you run the cache. 7x faster than RedisVL on repeated queries from a built-in embedding cache, same engine. The 48 to 136x against Upstash is local Valkey vs a cloud REST API, a deployment difference, and we say so.
Adversarial paraphrases are a wall for everyone. Every cosine-distance cache plateaus around 61% F1 on PAWS. No threshold recovers signal the embedding does not capture.

1. Quality is at parity, and that is the result to want

Fix the embedding model and every honest semantic cache is doing the same thing: embed the prompt, measure cosine distance against stored prompts, return a hit below a threshold. So peak F1 converges. We landed within about one percentage point of both RedisVL and Upstash at each library's optimal threshold, across STSb, SICK, PAWS-Wiki, and a real chatbot-prompt dataset from the vCache paper (ICLR 2026).

We are a newer and smaller library than either peer, and matching them on the core lookup is exactly the signal we wanted: the lookup is solved. Where we pull ahead is everything the library actually gets to decide, which is most of the rest of this post. On the dataset that looks most like real chatbot traffic, our self-tuning edged ahead by 1.3 points; everywhere else it is a dead heat, which on a fixed embedding model is correct.

The full per-dataset tables are in the RedisVL comparison and the Upstash comparison.

2. The threshold you copy is probably wrong for your runtime

This was the finding that surprised me most, and we only saw it because we ran against two different engines.

Against RedisVL, same engine and same runtime, the similarity distributions were identical and so were the optimal thresholds. Against Upstash they were not. Both adapters used bge-small-en-v1.5 by name, but our local ONNX runtime and Upstash's server-side runtime produced different score distributions, ours spread across [0, 0.50] and theirs compressed into [0, 0.26]. The optimal threshold differed accordingly, 0.20 for us and 0.10 for them.

The takeaway: a similarity threshold is not a constant you can look up. It is a property of your embedding runtime, your data, and your traffic. Copy the number from a blog post or a vendor default and you are very likely running at the wrong cutoff. That is the entire argument for tuning the threshold to your own deployment instead of guessing it, which is what we built next. Details in the Upstash comparison.

3. Self-tuning works, and the hard part is not the tuning

Our auto-tuning loop watches the live similarity-score distribution and adjusts the threshold up or down, applying high-confidence changes automatically and routing the rest to a human. On the datasets where the starting threshold was bad it gained real F1, up to +2.8% on STSb by loosening, +2.1% on SICK, +2.9% on the chatbot set by tightening. Same code path, no per-dataset configuration, adapting in both directions.

The honest part: the naive version of this loop destroys performance. Our first attempt tightened the threshold five times in a row chasing a signal that was actually noise, dropping F1 from 0.57 to 0.49. The engineering that matters is the safety logic, signal-quality guards, outcome tracking, velocity dampening, and a recall-cost check, that makes the loop either improve on a static threshold or match it exactly, and never make things worse. We walk through all four mechanisms, including the failure that motivated each, in the self-tuning deep-dive.

4. Latency is a deployment story, not an algorithm story

Two different latency results, and they mean different things.

Against RedisVL on the same engine, we are about 7x faster on repeated queries (around 0.57ms vs 3.46ms p50), entirely because we cache prompt embeddings and RedisVL recomputes them every call. That is a real library difference and it shows up in any workload with prompt repetition.

Against Upstash the numbers look enormous, 48 to 136x, but that is local Valkey against a cloud REST API. We are racing a process on localhost against a network round trip to another region. It is a deployment difference, not an algorithmic one, and we present it that way rather than pretending it is a fair race. The useful version of the claim: run the cache next to your app and you get sub-millisecond lookups a managed cloud vector API structurally cannot match.

5. Adversarial paraphrases beat every cosine cache

On PAWS-Wiki, every adapter we tested plateaus around 61% F1. "Flights from NY to FL" and "flights from FL to NY" have nearly identical embeddings, so no threshold separates them, and no amount of tuning or LLM-judging at default settings recovers signal the embedding never captured. This is not a BetterDB limit or a competitor limit, it is a property of cosine-distance caching. If your workload looks like that, you need a different architecture, not a better cache, and we would rather tell you that than hide it behind friendly datasets.

What the week actually argues

The benchmarks make a single point in five different ways: once everyone is at the quality ceiling, the cache you pick should be decided by what it does around the lookup, not by a fractional F1 delta. That is where we built deliberately.

Our libraries are MIT and run on Valkey, the open-source fork, and on any RESP-compatible endpoint. They ship adapters for seven frameworks (OpenAI, Anthropic, LangChain, LangGraph, LlamaIndex, Vercel AI SDK) with TypeScript and Python parity, and five embedding providers out of the box. Every operation emits OpenTelemetry spans and Prometheus metrics with no extra wiring, and the library tracks the dollars each cache hit saved against a bundled price table. None of the peers ship that combination. One honest caveat: the semantic cache needs valkey-search (Valkey 8+), so on stock ElastiCache or MemoryDB you would use our exact-match cache instead.

What is next

More datasets (MultiNLI for per-category tuning, larger sample sizes), additional embedding models including Redis's domain-specific one, and a batch of self-optimization features landing across both cache packages over the next couple of weeks. We will benchmark each the same way, including the parts where we only tie.

All numbers are reproducible. The harness, dataset loaders, and raw output are open source: Python harness and TypeScript harness. The three deep-dives: vs RedisVL, self-tuning, vs Upstash. The libraries are MIT: @betterdb/semantic-cache (npm), betterdb-semantic-cache (PyPI).