Monitoring
Monitoring
Effective monitoring is essential for maintaining ThunderDB in production. This guide covers the built-in metrics endpoint, health checks, structured logging, and how to set up a complete monitoring stack with Prometheus, Grafana, and alerting.
Prometheus Metrics Endpoint
ThunderDB exposes metrics in Prometheus format at the HTTP admin endpoint:
GET http://<host>:8088/admin/metrics
Example Request
curl http://localhost:8088/admin/metrics
Example Response
# HELP thunderdb_query_total Total number of queries executed
# TYPE thunderdb_query_total counter
thunderdb_query_total{protocol="pg",status="success"} 1542893
thunderdb_query_total{protocol="pg",status="error"} 127
thunderdb_query_total{protocol="mysql",status="success"} 89421
thunderdb_query_total{protocol="resp",status="success"} 2345678
# HELP thunderdb_query_duration_seconds Query execution time in seconds
# TYPE thunderdb_query_duration_seconds histogram
thunderdb_query_duration_seconds_bucket{protocol="pg",le="0.001"} 892345
thunderdb_query_duration_seconds_bucket{protocol="pg",le="0.01"} 1234567
thunderdb_query_duration_seconds_bucket{protocol="pg",le="0.1"} 1500000
thunderdb_query_duration_seconds_bucket{protocol="pg",le="1.0"} 1540000
thunderdb_query_duration_seconds_bucket{protocol="pg",le="10.0"} 1542893
thunderdb_query_duration_seconds_bucket{protocol="pg",le="+Inf"} 1542893
thunderdb_query_duration_seconds_sum{protocol="pg"} 4521.34
thunderdb_query_duration_seconds_count{protocol="pg"} 1542893
# HELP thunderdb_buffer_pool_hit_ratio Buffer pool cache hit ratio
# TYPE thunderdb_buffer_pool_hit_ratio gauge
thunderdb_buffer_pool_hit_ratio 0.9847
# HELP thunderdb_buffer_pool_pages_total Total pages in buffer pool
# TYPE thunderdb_buffer_pool_pages_total gauge
thunderdb_buffer_pool_pages_total{state="clean"} 7234
thunderdb_buffer_pool_pages_total{state="dirty"} 512
thunderdb_buffer_pool_pages_total{state="free"} 446
# HELP thunderdb_wal_size_bytes Current WAL size in bytes
# TYPE thunderdb_wal_size_bytes gauge
thunderdb_wal_size_bytes 134217728
# HELP thunderdb_connections_active Number of active client connections
# TYPE thunderdb_connections_active gauge
thunderdb_connections_active{protocol="pg"} 42
thunderdb_connections_active{protocol="mysql"} 15
thunderdb_connections_active{protocol="resp"} 128
thunderdb_connections_active{protocol="http"} 3
thunderdb_connections_active{protocol="grpc"} 8
# HELP thunderdb_replication_lag_seconds Replication lag from leader in seconds
# TYPE thunderdb_replication_lag_seconds gauge
thunderdb_replication_lag_seconds{peer="node-2"} 0.003
thunderdb_replication_lag_seconds{peer="node-3"} 0.005
# HELP thunderdb_transactions_total Total transactions
# TYPE thunderdb_transactions_total counter
thunderdb_transactions_total{status="committed"} 987654
thunderdb_transactions_total{status="aborted"} 1234
# HELP thunderdb_checkpoint_duration_seconds Time taken for last checkpoint
# TYPE thunderdb_checkpoint_duration_seconds gauge
thunderdb_checkpoint_duration_seconds 2.34
# HELP thunderdb_regions_total Number of data regions
# TYPE thunderdb_regions_total gauge
thunderdb_regions_total{node="1"} 128
thunderdb_regions_total{node="2"} 125
thunderdb_regions_total{node="3"} 127
# HELP thunderdb_raft_term Current Raft term
# TYPE thunderdb_raft_term gauge
thunderdb_raft_term 5
# HELP thunderdb_compaction_pending Number of pending compaction tasks
# TYPE thunderdb_compaction_pending gauge
thunderdb_compaction_pending 3
Key Metrics Reference
| Metric | Type | Description |
|---|---|---|
thunderdb_query_total | counter | Total queries executed, labeled by protocol and status. |
thunderdb_query_duration_seconds | histogram | Query execution latency distribution. |
thunderdb_buffer_pool_hit_ratio | gauge | Ratio of page reads served from buffer pool (target: >0.95). |
thunderdb_buffer_pool_pages_total | gauge | Buffer pool page counts by state (clean, dirty, free). |
thunderdb_wal_size_bytes | gauge | Current size of the WAL on disk. |
thunderdb_connections_active | gauge | Active connections per protocol. |
thunderdb_replication_lag_seconds | gauge | Replication lag from leader to each follower. |
thunderdb_transactions_total | counter | Transactions by outcome (committed, aborted). |
thunderdb_checkpoint_duration_seconds | gauge | Duration of the most recent checkpoint. |
thunderdb_regions_total | gauge | Number of data regions per node. |
thunderdb_raft_term | gauge | Current Raft consensus term. |
thunderdb_compaction_pending | gauge | Pending background compaction tasks. |
thunderdb_disk_usage_bytes | gauge | Disk usage by category (data, wal, temp). |
thunderdb_memory_usage_bytes | gauge | Memory usage by component (buffer_pool, wal_buffer, query). |
thunderdb_slow_queries_total | counter | Count of queries exceeding the slow query threshold. |
Prometheus Scrape Configuration
ThunderDB ships with a ready-to-use Prometheus configuration in deploy/prometheus/.
prometheus.yml
# deploy/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: "thunderdb"
metrics_path: /admin/metrics
scrape_interval: 10s
scrape_timeout: 5s
# For static deployments:
static_configs:
- targets:
- "thunderdb-1:8088"
- "thunderdb-2:8088"
- "thunderdb-3:8088"
labels:
cluster: "production"
# For Kubernetes deployments, replace static_configs with:
# kubernetes_sd_configs:
# - role: pod
# namespaces:
# names:
# - thunderdb
# relabel_configs:
# - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
# action: keep
# regex: "true"
# - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
# action: replace
# target_label: __address__
# regex: (.+)
# replacement: $1
# - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
# action: replace
# target_label: __metrics_path__
# regex: (.+)
Running Prometheus
docker run -d \
--name prometheus \
-p 9091:9090 \
-v $(pwd)/deploy/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
-v $(pwd)/deploy/prometheus/rules:/etc/prometheus/rules:ro \
prom/prometheus:latest
Grafana Dashboards
ThunderDB provides pre-built Grafana dashboards in deploy/grafana/. These dashboards give you immediate visibility into cluster health, query performance, storage utilization, and replication status.
Available Dashboards
| Dashboard | File | Description |
|---|---|---|
| Cluster Overview | deploy/grafana/dashboards/cluster-overview.json | High-level cluster health, node status, region distribution. |
| Query Performance | deploy/grafana/dashboards/query-performance.json | Query latency percentiles, throughput, slow queries by protocol. |
| Storage | deploy/grafana/dashboards/storage.json | Buffer pool hit rate, WAL size, disk usage, compaction status. |
| Replication | deploy/grafana/dashboards/replication.json | Replication lag, Raft term changes, leader elections. |
| Connections | deploy/grafana/dashboards/connections.json | Active connections by protocol, connection rate, errors. |
Setting Up Grafana
docker run -d \
--name grafana \
-p 3000:3000 \
-v $(pwd)/deploy/grafana/provisioning:/etc/grafana/provisioning:ro \
-v $(pwd)/deploy/grafana/dashboards:/var/lib/grafana/dashboards:ro \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
grafana/grafana:latest
Grafana is accessible at http://localhost:3000 (default credentials: admin/admin).
Provisioning Configuration
The provisioning directory automatically configures the Prometheus data source and dashboard imports:
# deploy/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# deploy/grafana/provisioning/dashboards/thunderdb.yml
apiVersion: 1
providers:
- name: ThunderDB
orgId: 1
folder: ThunderDB
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false
Key Dashboard Panels
Cluster Overview dashboard includes:
- Cluster status indicator (healthy/degraded/critical)
- Node status table with uptime, role, and region count
- Total queries per second across all protocols
- Average query latency (p50, p95, p99)
- Buffer pool hit ratio gauge
- Active connections count
Query Performance dashboard includes:
- Query throughput by protocol (QPS)
- Query latency histograms (p50, p95, p99, p99.9)
- Slow query count over time
- Query error rate
- Top slow queries table
- Query type distribution (SELECT, INSERT, UPDATE, DELETE)
Health Check Endpoints
ThunderDB exposes three health check endpoints for load balancers, orchestrators, and monitoring systems.
GET /admin/health
Returns the overall health status of the node, including subsystem checks.
curl http://localhost:8088/admin/health
Response (healthy):
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 86400,
"node_id": 1,
"cluster_role": "leader",
"checks": {
"storage": "ok",
"wal": "ok",
"raft": "ok",
"buffer_pool": "ok"
}
}
Response (degraded):
{
"status": "degraded",
"version": "0.1.0",
"uptime_seconds": 86400,
"node_id": 2,
"cluster_role": "follower",
"checks": {
"storage": "ok",
"wal": "ok",
"raft": "degraded: replication lag 5.2s",
"buffer_pool": "ok"
}
}
HTTP status codes:
200 OK– Node is healthy.503 Service Unavailable– Node is unhealthy or degraded.
GET /admin/live
Liveness probe. Returns 200 OK if the process is running and responsive. Used by Kubernetes liveness probes to determine if the pod should be restarted.
curl http://localhost:8088/admin/live
Response:
{
"status": "alive"
}
GET /admin/ready
Readiness probe. Returns 200 OK if the node is ready to serve traffic (storage initialized, WAL recovered, cluster joined). Used by Kubernetes readiness probes and load balancers.
curl http://localhost:8088/admin/ready
Response (ready):
{
"status": "ready",
"storage_initialized": true,
"wal_recovered": true,
"cluster_joined": true,
"regions_loaded": 128
}
Response (not ready):
{
"status": "not_ready",
"storage_initialized": true,
"wal_recovered": true,
"cluster_joined": false,
"regions_loaded": 0
}
HTTP status codes:
200 OK– Node is ready to serve traffic.503 Service Unavailable– Node is not ready.
Logging
ThunderDB produces structured logs that can be consumed by log aggregation systems such as the ELK stack, Loki, or Splunk.
Log Levels
| Level | Description | Use Case |
|---|---|---|
trace | Very detailed internal tracing | Deep debugging of specific subsystems |
debug | Detailed operational information | Development and troubleshooting |
info | Normal operational events | Production default |
warn | Potentially problematic situations | Slow queries, approaching limits |
error | Error conditions | Failed operations, connectivity issues |
Structured Log Format
When format = "json" is configured, logs are emitted as JSON lines:
{"timestamp":"2026-01-15T10:30:45.123Z","level":"info","target":"thunderdb::server","message":"Server started","node_id":1,"pg_port":5432,"mysql_port":3306,"resp_port":6379,"http_port":8088,"grpc_port":9090}
{"timestamp":"2026-01-15T10:30:45.456Z","level":"info","target":"thunderdb::cluster","message":"Cluster joined","node_id":1,"cluster_name":"production","role":"follower","term":1}
{"timestamp":"2026-01-15T10:30:46.789Z","level":"info","target":"thunderdb::cluster","message":"Leader elected","node_id":1,"leader_id":1,"term":2}
{"timestamp":"2026-01-15T10:31:15.012Z","level":"warn","target":"thunderdb::query","message":"Slow query detected","duration_ms":2345,"protocol":"pg","query":"SELECT * FROM orders JOIN products ON ...","client":"10.0.1.50:54321"}
When format = "text" is configured:
2026-01-15T10:30:45.123Z INFO thunderdb::server: Server started node_id=1 pg_port=5432 mysql_port=3306
2026-01-15T10:30:45.456Z INFO thunderdb::cluster: Cluster joined node_id=1 cluster_name=production role=follower
2026-01-15T10:31:15.012Z WARN thunderdb::query: Slow query detected duration_ms=2345 protocol=pg
Slow Query Log
When slow_query_enabled = true, queries exceeding slow_query_threshold are logged at WARN level with full query text, execution time, client address, and protocol:
{
"timestamp": "2026-01-15T10:31:15.012Z",
"level": "warn",
"target": "thunderdb::query::slow",
"message": "Slow query detected",
"duration_ms": 2345,
"protocol": "pg",
"query": "SELECT o.id, p.name, SUM(o.quantity) FROM orders o JOIN products p ON o.product_id = p.id GROUP BY o.id, p.name HAVING SUM(o.quantity) > 100",
"client": "10.0.1.50:54321",
"rows_examined": 1500000,
"rows_returned": 42,
"plan": "HashJoin -> SeqScan(orders) + IndexScan(products)"
}
Runtime Log Level Changes
Change the log level at runtime without restarting:
# Via HTTP API
curl -X PUT http://localhost:8088/admin/config/log_level -d '{"level": "debug"}'
# Via systemd reload
sudo systemctl reload thunderdb
# Via environment variable (requires restart)
THUNDERDB_LOG_LEVEL=debug
Log Rotation
When running under systemd, logs go to the journal and are rotated automatically. For file-based logging, configure logrotate:
# /etc/logrotate.d/thunderdb
/var/log/thunderdb/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
create 0640 thunder thunder
postrotate
systemctl reload thunderdb
endscript
}
Alerting Rules
ThunderDB ships with recommended alerting rules based on SLOs in deploy/slo.yaml. These rules can be loaded into Prometheus Alertmanager.
Alert Rules Configuration
# deploy/prometheus/rules/thunderdb-alerts.yml
groups:
- name: thunderdb.availability
rules:
- alert: ThunderDBDown
expr: up{job="thunderdb"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "ThunderDB node {{ $labels.instance }} is down"
description: "The ThunderDB node has been unreachable for more than 1 minute."
- alert: ThunderDBNotReady
expr: thunderdb_ready == 0
for: 5m
labels:
severity: critical
annotations:
summary: "ThunderDB node {{ $labels.instance }} is not ready"
description: "The node has been in a not-ready state for more than 5 minutes."
- name: thunderdb.performance
rules:
- alert: ThunderDBHighQueryLatency
expr: histogram_quantile(0.99, rate(thunderdb_query_duration_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "High p99 query latency on {{ $labels.instance }}"
description: "The p99 query latency has exceeded 5 seconds for more than 10 minutes."
- alert: ThunderDBLowBufferPoolHitRate
expr: thunderdb_buffer_pool_hit_ratio < 0.90
for: 15m
labels:
severity: warning
annotations:
summary: "Low buffer pool hit rate on {{ $labels.instance }}"
description: "Buffer pool hit rate is {{ $value }}, below the 0.90 threshold. Consider increasing buffer_pool_size."
- alert: ThunderDBHighSlowQueryRate
expr: rate(thunderdb_slow_queries_total[5m]) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "High rate of slow queries on {{ $labels.instance }}"
description: "More than 10 slow queries per second for the last 10 minutes."
- name: thunderdb.storage
rules:
- alert: ThunderDBWALSizeHigh
expr: thunderdb_wal_size_bytes > 0.8 * thunderdb_wal_max_size_bytes
for: 5m
labels:
severity: warning
annotations:
summary: "WAL size approaching limit on {{ $labels.instance }}"
description: "WAL is at {{ $value | humanize1024 }}, approaching the configured maximum."
- alert: ThunderDBDiskSpaceLow
expr: thunderdb_disk_usage_bytes / thunderdb_disk_total_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 85%. Consider expanding storage or archiving old data."
- alert: ThunderDBDiskSpaceCritical
expr: thunderdb_disk_usage_bytes / thunderdb_disk_total_bytes > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk usage is above 95%. Immediate action required."
- alert: ThunderDBCompactionBacklog
expr: thunderdb_compaction_pending > 50
for: 30m
labels:
severity: warning
annotations:
summary: "Compaction backlog on {{ $labels.instance }}"
description: "More than 50 pending compaction tasks. Consider increasing compaction_threads."
- name: thunderdb.cluster
rules:
- alert: ThunderDBReplicationLagHigh
expr: thunderdb_replication_lag_seconds > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High replication lag on {{ $labels.instance }}"
description: "Replication lag to {{ $labels.peer }} is {{ $value }}s."
- alert: ThunderDBReplicationLagCritical
expr: thunderdb_replication_lag_seconds > 60
for: 2m
labels:
severity: critical
annotations:
summary: "Critical replication lag on {{ $labels.instance }}"
description: "Replication lag to {{ $labels.peer }} has exceeded 60 seconds."
- alert: ThunderDBLeaderChanged
expr: changes(thunderdb_raft_term[5m]) > 2
labels:
severity: warning
annotations:
summary: "Frequent Raft leader elections on {{ $labels.instance }}"
description: "More than 2 leader elections in the last 5 minutes. Check network stability."
- alert: ThunderDBClusterDegraded
expr: count(up{job="thunderdb"} == 1) < 3
for: 1m
labels:
severity: critical
annotations:
summary: "ThunderDB cluster is degraded"
description: "Fewer than 3 nodes are healthy. Cluster may lose quorum."
- name: thunderdb.connections
rules:
- alert: ThunderDBHighConnectionCount
expr: sum(thunderdb_connections_active) by (instance) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High connection count on {{ $labels.instance }}"
description: "Active connections have exceeded 1000. Consider connection pooling."
- alert: ThunderDBHighErrorRate
expr: rate(thunderdb_query_total{status="error"}[5m]) / rate(thunderdb_query_total[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High query error rate on {{ $labels.instance }}"
description: "More than 5% of queries are failing."
SLO Definitions
# deploy/slo.yaml
slos:
- name: thunderdb-availability
description: "ThunderDB cluster availability"
target: 99.95%
window: 30d
indicator:
type: availability
query: "up{job='thunderdb'}"
- name: thunderdb-latency
description: "Query latency SLO"
target: 99%
window: 30d
indicator:
type: latency
threshold: 500ms
query: "histogram_quantile(0.99, rate(thunderdb_query_duration_seconds_bucket[5m]))"
- name: thunderdb-error-rate
description: "Query error rate SLO"
target: 99.9%
window: 30d
indicator:
type: error_rate
query: "rate(thunderdb_query_total{status='error'}[5m]) / rate(thunderdb_query_total[5m])"
Recommended Monitoring Stack Setup
For a complete production monitoring setup, deploy the following stack alongside ThunderDB:
Architecture
ThunderDB Nodes ──> Prometheus ──> Grafana
| |
| v
| Alertmanager ──> PagerDuty/Slack/Email
|
v
Log Aggregation (Loki/ELK)
Quick Start with Docker Compose
Use the provided docker-compose.monitoring.yml or add the monitoring services to your existing compose file (see Deployment Guide).
Step-by-Step Setup
- Deploy Prometheus with ThunderDB scrape configuration.
- Deploy Grafana with provisioned data source and dashboards.
- Deploy Alertmanager with notification channels (Slack, PagerDuty, email).
- Import alert rules from
deploy/prometheus/rules/. - Verify metrics flow by checking Prometheus targets page.
- Set up log aggregation (Loki for Grafana, or Elasticsearch + Kibana) for centralized log analysis.
- Test alerting by simulating a failure (e.g., stopping a node).
Operational Runbook Integration
Combine monitoring with operational runbooks (see deploy/runbook.md) to ensure alerts are actionable. Each alert should link to a runbook entry with:
- What the alert means
- How to diagnose the root cause
- Step-by-step remediation procedures
- Escalation paths
Feedback
Was this page helpful?
Glad to hear it! Tell us how we can improve.
Sorry to hear that. Tell us how we can improve.