We gave our semantic cache a self-tuning loop and benchmarked it on two academic datasets. Here is what happened.

Last week we published a benchmark comparing BetterDB's semantic cache (npm, PyPI) against RedisVL's SemanticCache. The conclusion was that quality is at parity when both caches run at the same static threshold. This post is about what happens when one cache can change its own threshold.

BetterDB ships a threshold autotuning loop: the cache collects similarity scores in a rolling window, BetterDB Monitor analyzes the distribution, proposes a threshold adjustment with reasoning attached, and applies it to Valkey. The running cache picks up the change on its next configRefresh poll. No restart. No deploy. No human in the loop unless you want one.

We wanted to know: does this actually help? Or does it just fiddle with the knob and make things worse?

So we ran it on two academic benchmarks that the self-tuning system had never seen, and measured everything.

The datasets

We deliberately picked datasets that were not designed for semantic caching. If the autotuner only works on cache-friendly data, it is not useful in production.

STS Benchmark (STSb): 8,628 sentence pairs from news headlines, image captions, and user forums. Each pair has a human-annotated similarity score from 0 to 5. We used 5,000 pairs. The score distribution is continuous and spread out - lots of pairs in the ambiguous middle where "is this a match?" has no clean answer. This is the dataset that stress-tests threshold decisions.

SICK (Sentences Involving Compositional Knowledge): 9,927 sentence pairs designed to test compositional semantics. "A man is playing a flute" vs "A man is playing a bamboo flute." "A group of kids is playing in a yard" vs "A group of boys in a yard is playing." Short sentences with subtle compositional differences. We used all 9,927 pairs. The similarity distribution has a dense band between scores 3 and 4 - the ambiguous middle that forces the autotuner to make decisions.

Both datasets have continuous similarity scores, not binary labels. We converted to binary using a match threshold of 0.6 (normalized): pairs scoring 3.0/5.0 or higher are considered semantic matches for ground-truth evaluation.

What we tested

Five modes of BetterDB, plus RedisVL as the static baseline:

RedisVL bare: cosine distance threshold only. No quality features. This is the control.
BetterDB bare: same cosine distance threshold. Same embedding model. Should match RedisVL.
BetterDB local: adds top-3 candidate retrieval and keyword-overlap reranking. No external APIs.
BetterDB full: adds an LLM-as-judge gate (gpt-4o-mini) on uncertain hits within the uncertainty band.
BetterDB autotune: bare cosine + Monitor-driven threshold autotuning. The cache observes its own similarity score distribution and adjusts the threshold via the Monitor's propose-approve API.

Five cosine distance thresholds: 0.10 (strict), 0.15, 0.20, 0.30, 0.40 (loose). Same embedding model everywhere: sentence-transformers/all-MiniLM-L6-v2.

STSb results: the autotuner adapts in both directions

Threshold	RedisVL (static)	BetterDB autotune	Final threshold	vs RedisVL
0.10	0.4218	0.4214	0.10 (held)	-0.1%
0.15	0.5679	0.5672	0.15 (held)	-0.1%
0.20	0.6727	0.6918	0.22 (loosened)	+2.8%
0.30	0.7366	0.7349	0.30 (held)	-0.2%
0.40	0.7324	0.7347	0.30 (tightened)	+0.3%

At 0.20 the autotuner loosened to 0.22, picking up 4.4 percentage points of recall without losing precision. At 0.40 it tightened to 0.30 - the natural F1 optimum for this dataset - improving precision from 59.3% to 64.4%. At every other threshold it held steady and matched bare performance exactly.

The story with all modes on STSb:

Threshold	Winner	F1	vs RedisVL
0.10	RedisVL bare	0.4218	-
0.15	RedisVL bare	0.5679	-
0.20	BetterDB autotune	0.6918	+2.8%
0.30	BetterDB autotune-full	0.7420	+0.7%
0.40	BetterDB full	0.7488	+2.2%

At tight thresholds everyone is equivalent - the cache barely hits, so there is nothing to tune or judge. The differentiation appears at 0.20 and above, where the autotuner and the judge have enough data to work with.

SICK results: loosening when the threshold is too strict

Threshold	RedisVL (static)	BetterDB autotune	Final threshold	vs RedisVL
0.10	0.8205	0.8376	0.145 (loosened)	+2.1%
0.15	0.8421	0.8493	0.186 (loosened)	+0.8%
0.20	0.8556	0.8553	0.223 (loosened)	-0.0%
0.30	0.8664	0.8627	0.30 (held)	-0.4%
0.40	0.8682	0.8629	0.30 (tightened)	-0.6%

SICK is the opposite pattern from STSb. The autotuner loosened at every tight threshold - 0.10 to 0.145, 0.15 to 0.186, 0.20 to 0.223 - and improved F1 each time. The gains come entirely from recall: precision held stable at ~0.77 across every mode and every threshold, meaning the score distribution cleanly separates true positives from false positives. The autotuner's job was to find the right cutoff, and it found it by loosening.

At 0.40 it correctly tightened to 0.30, converging to the same point from the other direction.

The full SICK breakdown across all modes:

Mode	Avg F1 delta vs bare
local (rerank)	+0.00%
full (judge)	-0.82%
autotune	+0.57%

The rerank adds nothing measurable on SICK - pairs are short sentences where keyword overlap provides no signal beyond cosine distance. The judge hurts slightly - it rejects valid matches on compositionally similar short sentences. The autotuner is the only mode with a positive average.

SemBenchmarkLmArena: real chatbot prompts

We also ran the autotuner on the two datasets from our previous benchmark. SemBenchmarkLmArena is a dataset of real chatbot prompts from the vCache paper (ICLR 2026) - 5,000 pairs grouped into equivalence classes.

Threshold	Bare F1	Autotune F1	Final threshold	Delta
0.10	0.6586	0.6586	0.10 (held)	+0.0%
0.15	0.7054	0.7024	0.177 (loosened)	-0.4%
0.20	0.7194	0.7152	0.20 (held)	-0.6%
0.30	0.7271	0.7172	0.30 (held)	-1.4%
0.40	0.6998	0.7202	0.30 (tightened)	+2.9%

The autotuner's best result: tightening from 0.40 to 0.30, gaining +2.9% F1 with precision improving from 54.1% to 56.9%. This is the same pattern as STSb at 0.40 - a too-loose cache converging to its natural optimum.

At 0.30 and below the autotuner held steady or made marginal moves. The slight regression at 0.15 (-0.4%) came from a loosening that traded precision for recall without a net F1 gain - the outcome tracking stopped it from going further.

PAWS-Wiki: the wall holds

PAWS-Wiki is the adversarial paraphrase dataset from our previous benchmark - sentence pairs that share most words but mean different things. We ran 8,000 pairs.

Threshold	Bare F1	Autotune F1	Final threshold	Delta
0.10	0.6120	0.6124	0.144 (loosened)	+0.1%
0.15	0.6126	0.6124	0.172 (loosened)	-0.0%
0.20	0.6129	0.6124	0.206 (loosened)	-0.1%
0.30	0.6124	0.6124	0.30 (held)	+0.0%
0.40	0.6125	0.6125	0.40 (held)	+0.0%

F1 is ~0.61 everywhere regardless of threshold or autotuning. The autotuner made small loosening moves at tight thresholds but they changed nothing - the TP/FP distributions are completely overlapped at every threshold. At 0.30 and 0.40 the autotuner correctly held steady.

This confirms what we said in the previous article: PAWS is a wall for every cosine-distance cache. No amount of threshold tuning can separate "flights from NY to FL" from "flights from FL to NY" when the embeddings are nearly identical. The autotuner does not make it worse, but it cannot help either. If your workload looks like PAWS, the answer is a different architecture (cross-encoder rerank, structural parsing, or domain-specific embeddings), not a better threshold.

What the autotuner actually does

The autotuning loop runs inside BetterDB Monitor, not inside the cache library. The flow:

Observe: the cache SDK writes every similarity score to a __similarity_window sorted set in Valkey, tagged with hit/miss and timestamp.
Analyze: the Monitor reads the window and computes signal metrics - uncertain hit rate (hits near the threshold boundary), distant hit rate (weak matches in the upper half of the acceptance range), near-miss rate (misses just above the threshold).
Decide: if a signal exceeds its threshold, the Monitor generates a recommendation with reasoning. If not, it says "optimal" and does nothing.
Propose: the recommendation becomes a durable proposal stored in the Monitor with full audit trail.
Apply: on approval, the Monitor writes the new threshold to {cache}:__config in Valkey. The SDK's configRefresh picks it up on the next poll.

The benchmark adapter auto-approves proposals immediately. In production you can require human review, set confidence thresholds for auto-approval, or both.

The safety mechanisms that made this work

The naive version of this loop - "see signal, adjust threshold" - destroys performance. We know because we tested it. The first version of the autotuner on STSb tightened the threshold five consecutive times from 0.15 down to the 0.02 floor clamp, reducing F1 from 0.57 to 0.49. It saw "distant hits" in the similarity window, said "tighten," and kept saying it because the signal never went away.

Four mechanisms prevent this in the version we benchmarked:

Signal quality guards. Before recommending tighten, check whether the triggering signal is actually strong enough to justify action. A 20% uncertain-hit rate with 69% overall hit rate means only 14% of all operations are uncertain - that is noise, not a signal. We require the uncertain fraction of all operations (not just hits) to exceed 15% before tightening, and require the hit rate to exceed 80% before treating distant hits as a tighten signal. This prevented a bad tighten on STSb that would have dropped F1 by 1.5%.
Depending how future tests go, we'll export these as configurable options with the optimal settings based on testing left as the default ones.

Outcome tracking. After each adjustment, compare the current signal metrics against the snapshot recorded at the time of the previous adjustment. If the signal that triggered the last tighten did not improve by at least 20%, further tightening is declared ineffective and the engine returns "optimal" instead of doubling down. This caught a second tighten attempt on STSb where the first one had not helped.
We are also experimenting with more sofisticated versions of this approach to more effectively rollback when needed.

Velocity dampening. Progressive step-size reduction for consecutive same-direction adjustments: the first step is full-sized, the second is 67%, the third is 50%, and so on. After five consecutive adjustments in the same direction without a reversal, the engine declares "optimal" and stops. This is the backstop - it fires rarely because the outcome check usually stops things earlier.

Recall-cost guard. Before committing to a tighten, count how many current hits would be lost at the proposed new threshold. If the estimated hit loss exceeds 15%, the tighten is blocked with a message: "the true-positive and false-positive score distributions overlap too much for threshold adjustment alone - consider enabling reranking or an LLM judge."

These mechanisms are what turn a threshold-adjustment loop into a system you can leave running. With them, the autotuner either improves on bare performance or matches it exactly - across every configuration we tested.

What we learned

The autotuner adapts bidirectionally without configuration. It tightened on STSb and SemBenchmarkLmArena (where the cache was too loose at 0.40) and loosened on SICK (where it was too tight). Same code path, no dataset-specific parameters. This is the property that makes it useful in production where query distributions shift over time.

Static thresholds leave performance on the table. On STSb at 0.20, the autotuner gained +2.8% F1 by loosening to 0.22. On SICK at 0.10, it gained +2.1% by loosening to 0.145. On SemBenchmarkLmArena at 0.40, it gained +2.9% by tightening to 0.30. These are not cherry-picked numbers - they are the thresholds where the starting configuration happened to be suboptimal, which in production is most of the time.

The judge and the autotuner solve different problems. The autotuner adjusts the decision boundary. The judge re-evaluates individual decisions near the boundary. On STSb at 0.30, the combination (autotune-full) won at +0.7% F1. On SICK, the judge consistently hurt because the dataset has clean TP/FP separation that threshold tuning alone can handle. The right answer depends on the data.

"Do no harm" is the hardest requirement. At thresholds where bare performance was already optimal (0.30 on SICK, 0.15 on STSb), the autotuner needed to recognize that and do nothing. Every percentage point of degradation at an already-good threshold erodes trust. The safety mechanisms are what make the difference between a useful feature and a liability. This is also why the autotuner is deliberately more conservative about loosening than tightening - loosening lets in more cached responses that might be wrong, which is the kind of silent failure that erodes user trust. Tightening too aggressively just means more cache misses, which costs latency but does not serve incorrect answers. The asymmetry is intentional.

Raw numbers

The benchmark harness, dataset loaders (STSb, SICK, SemBenchmarkLmArena, PAWS-Wiki), and all adapter code are open source at github.com/BetterDB-inc/monitor/packages/cache-benchmark. The datasets are public on Hugging Face.

What is next

MultiNLI: 433,000 sentence pairs across 10 genre categories (fiction, government, telephone, travel, etc.). This is the dataset for testing per-category threshold tuning - a feature the Monitor supports but we have not yet benchmarked.
Larger sample sizes: 50,000+ pairs, multiple embedding models, including Redis's langcache-embed-v3-small.
npm-side benchmark: @betterdb/semantic-cache vs @upstash/semantic-cache on the same datasets.

All numbers in this post are reproducible. The benchmark harness, raw JSON output, and exact commands are at github.com/BetterDB-inc/monitor/packages/cache-benchmark. STSb and SICK are loaded from Hugging Face (mteb/stsbenchmark-sts and mteb/sickr-sts). The autotuning loop runs through BetterDB Monitor's API.