Backup & Recovery
Backup & Recovery
Data protection is a critical aspect of operating ThunderDB in production. This guide covers full backup procedures, incremental backups through WAL archiving, point-in-time recovery (PITR), and disaster recovery planning.
Backup Overview
ThunderDB supports multiple backup strategies that can be combined for comprehensive data protection:
| Strategy | RPO | RTO | Storage Cost | Complexity |
|---|---|---|---|---|
| Full backup | Point-in-time of backup | Minutes to hours | High | Low |
| Incremental (WAL archiving) | Near-zero (last WAL flush) | Minutes | Low (deltas only) | Medium |
| Full + Incremental (PITR) | Near-zero | Minutes | Medium | Medium |
| Cross-region replication | Near-zero | Seconds (failover) | High | High |
RPO = Recovery Point Objective (how much data you can afford to lose). RTO = Recovery Time Objective (how quickly you need to recover).
Full Backup
A full backup captures the complete state of a ThunderDB node at a point in time. It includes the data directory and current WAL files.
Online Backup (Recommended)
ThunderDB supports online (hot) backups that do not require stopping the server:
# Trigger a consistent backup via the admin API
curl -X POST http://localhost:8088/admin/backup \
-H "Content-Type: application/json" \
-d '{
"destination": "/backups/thunderdb/full-2026-01-15",
"include_wal": true,
"compress": true
}'
Response:
{
"status": "started",
"backup_id": "bk-20260115-103045",
"destination": "/backups/thunderdb/full-2026-01-15",
"estimated_size_bytes": 5368709120
}
Monitor backup progress:
curl http://localhost:8088/admin/backup/status/bk-20260115-103045
Manual Full Backup
If you prefer a manual approach, follow these steps:
# 1. Force a checkpoint to flush dirty pages to disk
curl -X POST http://localhost:8088/admin/checkpoint
# 2. Copy the data directory
sudo -u thunder rsync -av --progress \
/var/lib/thunderdb/data/ \
/backups/thunderdb/full-$(date +%Y%m%d)/data/
# 3. Copy the WAL directory
sudo -u thunder rsync -av --progress \
/var/lib/thunderdb/wal/ \
/backups/thunderdb/full-$(date +%Y%m%d)/wal/
# 4. Copy the configuration
sudo cp /etc/thunderdb/thunderdb.toml \
/backups/thunderdb/full-$(date +%Y%m%d)/thunderdb.toml
Backup Contents
A full backup includes:
| Directory / File | Description |
|---|---|
data/ | All data pages, indexes, and metadata files. |
wal/ | Write-ahead log files active at backup time. |
thunderdb.toml | Configuration file (for reference during restore). |
backup_manifest.json | Metadata about the backup (timestamp, LSN, checksum). |
Verifying a Backup
Always verify backups after creation:
# Verify backup integrity
thunderdb --verify-backup /backups/thunderdb/full-2026-01-15
# Expected output:
# Backup verification: PASSED
# Backup time: 2026-01-15T10:30:45Z
# LSN: 0/1A3B4C5D
# Data pages: 8192 (verified)
# WAL segments: 12 (verified)
# Checksum: OK
Incremental Backup (WAL Archiving)
Incremental backups capture only the changes since the last full or incremental backup by archiving WAL (Write-Ahead Log) segments. This dramatically reduces backup storage and time.
Enabling WAL Archiving
Add the following to your configuration:
[storage]
# Enable WAL archiving
wal_archive_enabled = true
# Directory or command for archiving WAL segments
wal_archive_dir = "/backups/thunderdb/wal-archive"
# Alternative: archive via command (e.g., to S3)
# wal_archive_command = "aws s3 cp %f s3://my-bucket/thunderdb/wal/%n"
# Retain archived WAL segments for this duration
wal_archive_retention = "7d"
How WAL Archiving Works
- ThunderDB writes transactions to WAL segments (files of
max_wal_sizeor less). - When a WAL segment is full or a checkpoint occurs, the segment is archived (copied to the archive directory or processed by the archive command).
- Archived segments are retained according to
wal_archive_retention. - For recovery, archived WAL segments are replayed on top of a full backup.
Monitoring WAL Archiving
# Check archiving status
curl http://localhost:8088/admin/wal/archive/status
# Response:
# {
# "archiving_enabled": true,
# "last_archived_segment": "000000010000000000000042",
# "last_archive_time": "2026-01-15T10:30:45Z",
# "segments_pending": 0,
# "archive_rate_bytes_per_sec": 1048576,
# "total_archived_bytes": 536870912
# }
WAL Archive to Object Storage
For cloud deployments, archive WAL segments to object storage:
Amazon S3:
[storage]
wal_archive_command = "aws s3 cp %f s3://my-thunderdb-backups/wal/%n --storage-class STANDARD_IA"
wal_restore_command = "aws s3 cp s3://my-thunderdb-backups/wal/%n %f"
Google Cloud Storage:
[storage]
wal_archive_command = "gsutil cp %f gs://my-thunderdb-backups/wal/%n"
wal_restore_command = "gsutil cp gs://my-thunderdb-backups/wal/%n %f"
Azure Blob Storage:
[storage]
wal_archive_command = "az storage blob upload --file %f --container-name thunderdb-wal --name %n"
wal_restore_command = "az storage blob download --container-name thunderdb-wal --name %n --file %f"
In these commands:
%fis replaced with the full path to the WAL segment file.%nis replaced with the WAL segment filename only.
Point-in-Time Recovery (PITR)
PITR allows you to restore a ThunderDB instance to any specific point in time, provided you have a full backup and the WAL archive covering that time period.
Prerequisites
- A full backup taken before the target recovery time.
- All WAL archive segments from the backup time to the target recovery time.
Recovery Procedure
# Restore to a specific timestamp
thunderdb --restore-pitr "2026-01-15T14:30:00Z" \
--restore-backup /backups/thunderdb/full-2026-01-15
# Restore to the latest available point
thunderdb --restore-pitr "latest" \
--restore-backup /backups/thunderdb/full-2026-01-15
# Restore to a specific WAL LSN (Log Sequence Number)
thunderdb --restore-pitr "lsn:0/1A3B4C5D" \
--restore-backup /backups/thunderdb/full-2026-01-15
Step-by-Step PITR Process
Stop ThunderDB on the target node:
sudo systemctl stop thunderdbClear the existing data and WAL directories (or use a fresh node):
sudo rm -rf /var/lib/thunderdb/data/* sudo rm -rf /var/lib/thunderdb/wal/*Restore from the full backup:
sudo -u thunder cp -a /backups/thunderdb/full-2026-01-15/data/* /var/lib/thunderdb/data/ sudo -u thunder cp -a /backups/thunderdb/full-2026-01-15/wal/* /var/lib/thunderdb/wal/Run PITR recovery:
thunderdb --restore-pitr "2026-01-15T14:30:00Z" \ --restore-backup /backups/thunderdb/full-2026-01-15 \ --wal-archive-dir /backups/thunderdb/wal-archive \ --config /etc/thunderdb/thunderdb.tomlStart ThunderDB normally after recovery completes:
sudo systemctl start thunderdbVerify data integrity after recovery:
curl http://localhost:8088/admin/health # Connect via psql and verify data psql -h localhost -U admin -c "SELECT COUNT(*) FROM your_table;"
Recovery Output
During PITR, ThunderDB logs the recovery progress:
[INFO] Starting Point-in-Time Recovery
[INFO] Target time: 2026-01-15T14:30:00Z
[INFO] Base backup LSN: 0/1A000000
[INFO] Phase 1: Restoring base backup... done (8192 pages)
[INFO] Phase 2: Replaying WAL segments...
[INFO] Replaying segment 000000010000000000000038... done
[INFO] Replaying segment 000000010000000000000039... done
[INFO] Replaying segment 00000001000000000000003A... done (target reached)
[INFO] Phase 3: Recovery complete
[INFO] Recovered to: 2026-01-15T14:30:00.000Z (LSN: 0/1A3B4C5D)
[INFO] Transactions recovered: 45,231
[INFO] Transactions rolled back: 12 (in-progress at recovery point)
WAL-Based Recovery (ARIES)
ThunderDB uses the ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) recovery algorithm, which provides crash recovery through a three-phase process. This happens automatically on startup after an unclean shutdown.
Phase 1: Analysis
The analysis phase scans the WAL from the last checkpoint to determine:
- Which pages were dirty (modified but not flushed) at the time of the crash.
- Which transactions were active (not yet committed or aborted).
- The starting point for the redo phase.
Phase 2: Redo
The redo phase replays all WAL records from the analysis starting point forward, reapplying all changes to bring the database to the exact state it was in at the time of the crash. This includes changes made by transactions that will later be undone.
Phase 3: Undo
The undo phase rolls back all transactions that were active (not committed) at the time of the crash. It processes undo records in reverse order, ensuring the database only contains the effects of committed transactions.
Monitoring Recovery
Recovery progress is logged during startup:
[INFO] Crash recovery initiated
[INFO] Last checkpoint LSN: 0/1A000000
[INFO] Phase 1 (Analysis): Scanning WAL... 42 dirty pages, 3 active transactions
[INFO] Phase 2 (Redo): Replaying from LSN 0/1A000000... 1,234 records replayed
[INFO] Phase 3 (Undo): Rolling back 3 active transactions... done
[INFO] Recovery complete in 2.34s
Recovery Tuning
For large databases where recovery time is critical:
[storage]
# More frequent checkpoints reduce the amount of WAL to replay
checkpoint_interval = "30s"
# Smaller max WAL size limits recovery scope
max_wal_size = "512MB"
Backup Scheduling Best Practices
Recommended Backup Schedule
| Backup Type | Frequency | Retention | Purpose |
|---|---|---|---|
| Full backup | Daily (off-peak hours) | 7 days | Base for PITR, standalone restore |
| WAL archiving | Continuous | 7 days | Incremental changes for PITR |
| Full backup (weekly) | Weekly (Sunday night) | 30 days | Longer-term recovery |
| Full backup (monthly) | Monthly | 12 months | Compliance and archival |
Automated Backup Script
#!/bin/bash
# /etc/cron.d/thunderdb-backup.sh
# Run daily at 2:00 AM: 0 2 * * * /etc/cron.d/thunderdb-backup.sh
set -euo pipefail
BACKUP_DIR="/backups/thunderdb"
DATE=$(date +%Y%m%d)
RETENTION_DAYS=7
echo "[$(date)] Starting ThunderDB backup..."
# Trigger online backup
RESPONSE=$(curl -s -X POST http://localhost:8088/admin/backup \
-H "Content-Type: application/json" \
-d "{\"destination\": \"${BACKUP_DIR}/full-${DATE}\", \"include_wal\": true, \"compress\": true}")
BACKUP_ID=$(echo $RESPONSE | jq -r '.backup_id')
echo "[$(date)] Backup started: $BACKUP_ID"
# Wait for backup to complete
while true; do
STATUS=$(curl -s http://localhost:8088/admin/backup/status/$BACKUP_ID | jq -r '.status')
if [ "$STATUS" = "completed" ]; then
echo "[$(date)] Backup completed successfully"
break
elif [ "$STATUS" = "failed" ]; then
echo "[$(date)] ERROR: Backup failed!"
exit 1
fi
sleep 10
done
# Verify backup
thunderdb --verify-backup "${BACKUP_DIR}/full-${DATE}"
# Remove old backups
find "${BACKUP_DIR}" -maxdepth 1 -name "full-*" -mtime +${RETENTION_DAYS} -exec rm -rf {} \;
echo "[$(date)] Cleaned up backups older than ${RETENTION_DAYS} days"
echo "[$(date)] Backup process complete"
Backup Monitoring
Add alerts for backup failures:
# deploy/prometheus/rules/thunderdb-backup-alerts.yml
groups:
- name: thunderdb.backup
rules:
- alert: ThunderDBBackupFailed
expr: thunderdb_last_backup_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "ThunderDB backup failed on {{ $labels.instance }}"
- alert: ThunderDBBackupStale
expr: time() - thunderdb_last_backup_timestamp > 86400 * 2
for: 1h
labels:
severity: warning
annotations:
summary: "No successful backup in 48 hours on {{ $labels.instance }}"
- alert: ThunderDBWALArchiveLag
expr: thunderdb_wal_archive_pending_segments > 100
for: 10m
labels:
severity: warning
annotations:
summary: "WAL archiving is falling behind on {{ $labels.instance }}"
Disaster Recovery Planning
Recovery Scenarios
| Scenario | Strategy | Expected RTO |
|---|---|---|
| Single node failure (cluster) | Automatic Raft failover | Seconds |
| Data corruption on one node | Restore from backup or re-replicate from peers | Minutes |
| Full cluster failure (same region) | Restore all nodes from backup + WAL archive | 30-60 minutes |
| Regional disaster | Failover to cross-region replica | Minutes |
| Accidental data deletion | PITR to before deletion timestamp | 15-30 minutes |
Cross-Region Backup Strategy
For maximum disaster resilience, maintain backups in a different geographic region:
Primary Region (us-east-1) Backup Region (us-west-2)
+-------------------+ +-------------------+
| ThunderDB Cluster | -- WAL --> | S3 Bucket |
| (3 nodes) | archiving | (WAL archive) |
| | | |
| Daily full backup | -- sync --> | S3 Bucket |
| (local disk) | | (full backups) |
+-------------------+ +-------------------+
Implementation:
# Sync local backups to a remote region
aws s3 sync /backups/thunderdb/ s3://thunderdb-backups-us-west-2/ \
--storage-class STANDARD_IA \
--region us-west-2
# Configure WAL archiving to the remote region
# In thunderdb.toml:
# wal_archive_command = "aws s3 cp %f s3://thunderdb-backups-us-west-2/wal/%n --region us-west-2"
Disaster Recovery Runbook
- Assess the failure scope: Determine whether it is a single node, full cluster, or regional failure.
- For single node failure: If the cluster has quorum, the node will recover automatically. Otherwise, restore from backup.
- For full cluster failure: a. Provision new infrastructure (or use standby). b. Restore the most recent full backup on each node. c. Apply WAL archive to bring all nodes to the same point. d. Start the cluster and verify data integrity. e. Update DNS/load balancer to point to the recovered cluster.
- For accidental deletion: Use PITR to recover to a timestamp just before the deletion.
- Post-recovery: Verify data integrity, check replication status, resume backups.
Testing Recovery
Regularly test your recovery procedures (at least quarterly):
# 1. Provision a test environment
# 2. Restore from the latest production backup
thunderdb --restore-pitr "latest" \
--restore-backup /backups/thunderdb/full-latest \
--config /etc/thunderdb/thunderdb-test.toml
# 3. Start the test instance
thunderdb --config /etc/thunderdb/thunderdb-test.toml
# 4. Run validation queries to verify data integrity
psql -h localhost -p 15432 -U admin -f /scripts/recovery-validation.sql
# 5. Document the recovery time and any issues
Record the results of each recovery test, including actual RTO, data completeness verification, and any problems encountered.
Feedback
Was this page helpful?
Glad to hear it! Tell us how we can improve.
Sorry to hear that. Tell us how we can improve.