What BetterDB Exposes to Prometheus (And Why It Matters)

Most teams running Valkey in production have the same blind spot: great visibility into what's happening right now, and almost none for what happened in the past. Valkey's built-in operational data - slowlogs, command logs, client connections, ACL events - is ephemeral by design. It lives in a ring buffer. When it scrolls out, it's gone.

The result is a familiar pattern: something goes wrong at 3am, on-call gets paged, and by the time anyone looks at it the evidence has already been overwritten. You're left correlating timestamps, guessing at causes, and hoping it doesn't happen again.

This is the problem BetterDB is built to solve. The agent runs inside your VPC, continuously polls Valkey's operational data, and persists it so you can query it later. The Prometheus integration is the other half of that story - it takes that same data and feeds it into the observability stack you already have. If your team lives in Grafana and Alertmanager, you don't need a new dashboard to benefit from what BetterDB collects. You just add a scrape target.

This post covers what we export, how to configure scraping, and a set of PromQL queries worth keeping around.

The Endpoint

GET /prometheus/metrics
Content-Type: text/plain; version=0.0.4; charset=utf-8

All metrics are prefixed with betterdb_. Standard Node.js process metrics from prom-client are included with the same prefix. Metrics are computed on-demand at scrape time.

Add this to your prometheus.yml:

scrape_configs:
  - job_name: 'betterdb'
    static_configs:
      - targets: ['localhost:3001']
    metrics_path: '/prometheus/metrics'
    scrape_interval: 15s
    scrape_timeout: 10s

For multi-instance setups:

scrape_configs:
  - job_name: 'betterdb'
    static_configs:
      - targets:
        - 'betterdb-prod-1:3001'
        - 'betterdb-prod-2:3001'
        - 'betterdb-staging:3001'

What Gets Exported

Slowlog Patterns

Raw slowlog length is close to useless as an alert condition - it tells you something is slow, not what. BetterDB aggregates slowlog entries into command patterns before exporting them:

betterdb_slowlog_length
betterdb_slowlog_last_id
betterdb_slowlog_pattern_count{pattern="HGETALL *"}
betterdb_slowlog_pattern_avg_duration_us{pattern="HGETALL *"}
betterdb_slowlog_pattern_percentage{pattern="HGETALL *"}

This means you can alert on a specific pattern spiking rather than the aggregate. HGETALL * accounting for 35% of your slowlog is actionable. slowlog_length > 100 is not.

# Top 5 slow command patterns
topk(5, betterdb_slowlog_pattern_count)

# Slowest by average duration
topk(5, betterdb_slowlog_pattern_avg_duration_us)

COMMANDLOG (Valkey 8.1+)

COMMANDLOG is Valkey-specific and tracks commands by payload size rather than execution time. A command that returns 50MB of data in 2ms won't show up in SLOWLOG. It will show up here.

betterdb_commandlog_large_request
betterdb_commandlog_large_reply
betterdb_commandlog_large_request_by_pattern{pattern="MSET *"}
betterdb_commandlog_large_reply_by_pattern{pattern="LRANGE *"}

# Correlate large reply volume with output bandwidth pressure
betterdb_commandlog_large_reply_by_pattern
  * on() group_left()
  betterdb_instantaneous_output_kbps

If you're on Redis or Valkey < 8.1, these metrics simply won't appear - no errors, no zeros, just no data. The same logic applies for cluster slot metrics below.

Memory

betterdb_memory_used_bytes
betterdb_memory_used_rss_bytes
betterdb_memory_used_peak_bytes
betterdb_memory_max_bytes
betterdb_memory_fragmentation_ratio
betterdb_memory_fragmentation_bytes

# How close are you to maxmemory?
betterdb_memory_used_bytes / betterdb_memory_max_bytes * 100

# Fragmentation worth investigating (> 1.5 is a signal)
betterdb_memory_fragmentation_ratio > 1.5

Alert on memory utilization before your eviction policy kicks in, not after. Eviction is a last resort, not a steady state.

Throughput and Hit Rate

betterdb_instantaneous_ops_per_sec
betterdb_instantaneous_input_kbps
betterdb_instantaneous_output_kbps
betterdb_keyspace_hits_total
betterdb_keyspace_misses_total
betterdb_commands_processed_total
betterdb_connections_received_total
betterdb_evicted_keys_total
betterdb_expired_keys_total

# Cache hit rate
(betterdb_keyspace_hits_total /
  (betterdb_keyspace_hits_total + betterdb_keyspace_misses_total)) * 100

# Total network throughput
betterdb_instantaneous_input_kbps + betterdb_instantaneous_output_kbps

# Eviction rate (non-zero is a warning)
rate(betterdb_evicted_keys_total[5m])

Client Connections

betterdb_client_connections_current
betterdb_client_connections_peak
betterdb_client_connections_by_name{client_name="api-service"}
betterdb_client_connections_by_user{user="app-user"}
betterdb_connected_clients
betterdb_blocked_clients

# Peak vs current - gap closing is a signal
betterdb_client_connections_peak - betterdb_client_connections_current

# Connection growth rate (flat traffic + rising connections = leak)
rate(betterdb_connections_received_total[5m])

# Blocked clients (should be near zero)
betterdb_blocked_clients > 0

betterdb_client_connections_by_name is particularly useful for tracking down which service is leaking connections. Cardinality scales with unique client names, so if that's a concern use relabel_configs to aggregate.

ACL Audit

betterdb_acl_denied
betterdb_acl_denied_by_reason{reason="command"}
betterdb_acl_denied_by_user{username="readonly-user"}

Reason labels are auth, command, key, and channel. A spike in auth denials at odd hours is worth an alert.

Keyspace

betterdb_db_keys{db="db0"}
betterdb_db_keys_expiring{db="db0"}
betterdb_db_avg_ttl_seconds{db="db0"}

# Expiry ratio - low means keys are accumulating without TTLs
betterdb_db_keys_expiring / betterdb_db_keys * 100

Replication

betterdb_connected_slaves
betterdb_replication_offset
betterdb_master_link_up
betterdb_master_last_io_seconds_ago

# Replica lag - alert if > 10s
betterdb_master_last_io_seconds_ago > 10

# Link down
betterdb_master_link_up == 0

Cluster (Valkey 8.0+)

betterdb_cluster_enabled
betterdb_cluster_known_nodes
betterdb_cluster_size
betterdb_cluster_slots_assigned
betterdb_cluster_slots_ok
betterdb_cluster_slots_fail
betterdb_cluster_slots_pfail
betterdb_cluster_slot_keys{slot="1234"}
betterdb_cluster_slot_reads_total{slot="1234"}
betterdb_cluster_slot_writes_total{slot="1234"}

# Cluster health percentage
(betterdb_cluster_slots_ok / betterdb_cluster_slots_assigned) * 100

# Any failing slots
betterdb_cluster_slots_fail + betterdb_cluster_slots_pfail > 0

# Hot slot detection
topk(10, betterdb_cluster_slot_reads_total + betterdb_cluster_slot_writes_total)

Slot metrics are capped at the top 100 slots by key count to keep cardinality manageable.

Anomaly Detection

BetterDB runs statistical anomaly detection on your Valkey metrics (connections, ops/sec, memory, slowlog count, evictions, ACL denials, and more) and surfaces the results directly as Prometheus metrics:

betterdb_anomaly_events_total{severity="critical", metric_type="memory_used", anomaly_type="spike"}
betterdb_anomaly_events_current{severity="warning"}
betterdb_anomaly_by_severity{severity="critical"}
betterdb_anomaly_by_metric{metric_type="slowlog_count"}
betterdb_correlated_groups_by_pattern{pattern="memory_pressure"}
betterdb_anomaly_buffer_ready{metric_type="ops_per_sec"}
betterdb_anomaly_buffer_mean{metric_type="connections"}
betterdb_anomaly_buffer_stddev{metric_type="connections"}

The detection system needs 30 samples to warm up (30 seconds at the default 1s poll rate). betterdb_anomaly_buffer_ready tells you which metric types are live.

# Anomaly rate over 5 minutes
rate(betterdb_anomaly_events_total[5m])

# Current critical anomalies
betterdb_anomaly_by_severity{severity="critical"} > 0

# Memory pressure pattern detected
betterdb_correlated_groups_by_pattern{pattern="memory_pressure"} > 0

Correlated pattern values include traffic_burst, batch_job, memory_pressure, slow_queries, auth_attack, connection_leak, cache_thrashing, node_failover. When multiple metrics spike together in a recognizable pattern, BetterDB surfaces the correlation rather than firing individual alerts for each.

A Few Alerts Worth Setting Up

BetterDB ships with ready-to-use Alertmanager rules at docs/alertmanager-rules.yml. A starting point:

groups:
  - name: valkey
    rules:
      - alert: ValkeyMemoryHigh
        expr: betterdb_memory_used_bytes / betterdb_memory_max_bytes > 0.85
        for: 5m

      - alert: ValkeyEvictionActive
        expr: rate(betterdb_evicted_keys_total[5m]) > 0
        for: 2m

      - alert: ValkeySlowQueryPatternSpike
        expr: increase(betterdb_slowlog_pattern_count[5m]) > 20

      - alert: ValkeyReplicaLag
        expr: betterdb_master_last_io_seconds_ago > 10

      - alert: ValkeyAnomalyDetected
        expr: betterdb_anomaly_by_severity{severity="critical"} > 0
        for: 1m

      - alert: ValkeyClusterSlotFailing
        expr: betterdb_cluster_slots_fail > 0

Cardinality Notes

A few metrics can grow if you have a lot of unique label values:

betterdb_client_connections_by_name - one series per unique client name
betterdb_client_connections_by_user - one series per unique username
betterdb_acl_denied_by_user - one series per username with auth failures
betterdb_cluster_slot_* - capped at 100 slots automatically

If the first three become a problem, use metric_relabel_configs to drop or aggregate labels before they hit your TSDB.

Wrapping Up

Valkey's operational data is ephemeral by default. Slowlogs scroll out, command logs reset, client lists change. Shipping it to Prometheus gives you the timeline to reconstruct what actually happened - whether that's a 3am latency spike, a gradual memory trend, or a connection leak that built up over days.

If you're already running Grafana and Alertmanager, this drops straight into what you have. No new dashboards to learn, no new tooling to justify. For teams that want to go deeper - pattern analysis, historical slowlog trends, client analytics, anomaly timelines - all of that is available in the BetterDB platform directly.

The full metric reference is at docs.betterdb.com/prometheus-metrics.html.