spacedrive/docs/core/sync-event-log.mdx
Jamie Pine dbcfcb3c56 Add Sync Events Export CLI and Protocol Support
- Add SyncEventsArgs to export sync events from the CLI - Wire
SyncCmd::Events and implement export_events to fetch - format and write
results - Implement JSON, SQL, and Markdown exporters - with optional
device data in the output - Extend protocol with EventLogRequest and
EventLogResponse - Enable LogSyncHandler to handle event log requests
and return logs - Expose log_handler from BackfillManager for event
logging - Update docs with CLI examples and protocol overview
2025-12-03 21:58:29 -08:00

599 lines
16 KiB
Plaintext

---
title: Sync Event Log
sidebarTitle: Sync Event Log
---
Spacedrive maintains a persistent event log of high-level sync operations for debugging and observability. Unlike the in-memory metrics system, these events survive app restarts and provide a queryable timeline of what happened during sync.
## Overview
The sync event log captures critical lifecycle events, data flow operations, and errors without overwhelming the database. Events are stored in `sync.db` and auto-pruned after 7 days.
**Key Features:**
- Persistent storage (survives restarts)
- Batch aggregation (10-1000x write reduction)
- Correlation IDs (track entire backfill sessions)
- Flexible query API (filter by type, time, peer, severity)
- Minimal overhead (<100KB for 7 days)
## Event Types
### Lifecycle Events
**StateTransition** - State machine changes
```
Uninitialized → Backfilling → CatchingUp → Ready ⇄ Paused
```
**BackfillSessionStarted / Completed** - Initial sync sessions
- Includes correlation ID for tracing entire session
- Logs peer selection details with scoring
- Tracks duration and record counts
### Data Flow Events
**BatchIngestion** - Aggregated record applications
- Batched every 30 seconds or 10,000 records
- Per-model breakdown (entries, tags, locations, etc.)
- Per-peer attribution
- Duration tracking
Example:
```
Ingested batch of 45,000 records (25k entries, 15k tags, 5k locations) in 28s
```
### Critical Error Events
**Buffer Overflow** - When update buffer hits capacity (100k updates)
```json
{
"event_type": "sync_error",
"severity": "error",
"summary": "Buffer overflow: 1,247 updates dropped during backfill",
"details": {
"dropped_count": 1247,
"max_buffer_size": 100000,
"phase": "backfill_to_catchup_transition"
}
}
```
**Stuck Dependencies** - Orphaned records waiting for missing FK parents
```json
{
"event_type": "sync_error",
"severity": "warning",
"summary": "Stuck dependencies: 12 records waiting for 3 missing parents",
"details": {
"total_dependencies": 3,
"total_waiting_updates": 12,
"phase": "backfill_to_ready_transition"
}
}
```
**Watermark Too Old** - Forces full sync instead of incremental
```json
{
"event_type": "sync_error",
"severity": "warning",
"summary": "Watermark too old: 35 days (threshold: 25 days), forcing full sync",
"details": {
"watermark_age_days": 35,
"threshold_days": 25,
"action": "force_full_sync"
}
}
```
## Database Schema
Events are stored in the `sync_event_log` table in `sync.db`:
```sql
CREATE TABLE sync_event_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
device_id TEXT NOT NULL,
event_type TEXT NOT NULL,
category TEXT NOT NULL, -- lifecycle, data_flow, network, error
severity TEXT NOT NULL, -- debug, info, warning, error
summary TEXT NOT NULL,
details TEXT, -- JSON blob
correlation_id TEXT, -- Session tracking
peer_device_id TEXT,
model_types TEXT, -- Comma-separated
record_count INTEGER,
duration_ms INTEGER
);
-- Indexes for fast queries
CREATE INDEX idx_sync_event_log_timestamp ON sync_event_log(timestamp);
CREATE INDEX idx_sync_event_log_type ON sync_event_log(event_type);
CREATE INDEX idx_sync_event_log_correlation ON sync_event_log(correlation_id);
```
## CLI Export
### Export Events from a Device
```bash
# Export as JSON (default)
sd sync events
# Export to file
sd sync events --output sync_events.json
# Export last hour only
sd sync events --since "1 hour ago" -o recent_events.json
# Export as SQL (for database import)
sd sync events --format sql -o events.sql
# Export as Markdown
sd sync events --format markdown -o events.md
# Filter by event type
sd sync events --event-type state_transition
# Filter by severity (errors only)
sd sync events --severity error
# Trace specific backfill session
sd sync events --correlation-id abc-123-def
# Limit results
sd sync events --limit 500
```
### Cross-Device Timeline
To debug sync issues across multiple devices, export logs from each device and merge:
```bash
# On Device A
sd sync events --since "2 hours ago" --output device_a.json
# On Device B
sd sync events --since "2 hours ago" --output device_b.json
# Merge chronologically (requires jq)
jq -s 'add | sort_by(.timestamp)' device_a.json device_b.json > timeline.json
# View merged timeline
jq '.[] | "\(.timestamp) | \(.device_id[0:8]) | \(.event_type) | \(.summary)"' timeline.json
```
### SQL Import for Analysis
```bash
# Export as SQL from both devices
sd sync events --format sql -o device_a.sql
sd sync events --format sql -o device_b.sql
# Create combined database
sqlite3 timeline.db < device_a.sql
sqlite3 timeline.db < device_b.sql
# Query cross-device timeline
sqlite3 timeline.db "
SELECT
timestamp,
substr(device_id, 1, 8) as device,
event_type,
summary
FROM sync_event_log
ORDER BY timestamp
"
# Find matching backfill sessions across devices
sqlite3 timeline.db "
SELECT
correlation_id,
COUNT(DISTINCT device_id) as device_count,
MIN(timestamp) as started,
MAX(timestamp) as completed
FROM sync_event_log
WHERE correlation_id IS NOT NULL
GROUP BY correlation_id
ORDER BY started DESC
"
```
## Query API
### Rust API
```rust
use sd_core::infra::sync::{SyncEventQuery, SyncEventType, EventSeverity};
// Get all events from last hour
let events = sync_service.event_logger()
.query(
SyncEventQuery::new(library_id)
.with_time_range(one_hour_ago, now)
)
.await?;
// Get critical errors only
let errors = sync_service.event_logger()
.query(
SyncEventQuery::new(library_id)
.with_severities(vec![EventSeverity::Error])
.with_limit(100)
)
.await?;
// Trace entire backfill session
let session_events = sync_service.event_logger()
.query(
SyncEventQuery::new(library_id)
.with_correlation_id(session_id)
)
.await?;
```
### rspc API
Frontend access via `sync.eventLog` query:
```typescript
const events = await client.query(['sync.eventLog', {
event_types: ['state_transition', 'backfill_session_started'],
start_time: oneHourAgo,
limit: 100
}]);
// Filter by correlation ID
const sessionEvents = await client.query(['sync.eventLog', {
correlation_id: sessionId
}]);
// Get only errors
const errors = await client.query(['sync.eventLog', {
severities: ['error', 'warning'],
limit: 50
}]);
```
### SQL Queries
Direct database queries for advanced debugging:
```sql
-- Find all backfill sessions
SELECT timestamp, summary, duration_ms, peer_device_id
FROM sync_event_log
WHERE event_type = 'backfill_session_completed'
ORDER BY timestamp DESC;
-- Trace session by correlation ID
SELECT timestamp, event_type, summary, record_count, duration_ms
FROM sync_event_log
WHERE correlation_id = 'abc-123-def'
ORDER BY timestamp;
-- Find buffer overflows
SELECT * FROM sync_event_log
WHERE summary LIKE '%Buffer overflow%'
ORDER BY timestamp DESC;
-- Check batch ingestion performance
SELECT timestamp, summary, record_count, duration_ms,
CAST(record_count AS FLOAT) / (duration_ms / 1000.0) as records_per_sec
FROM sync_event_log
WHERE event_type = 'batch_ingestion'
AND duration_ms > 1000
ORDER BY timestamp DESC;
-- Peer selection patterns
SELECT timestamp, peer_device_id, details
FROM sync_event_log
WHERE event_type = 'backfill_session_started'
AND details IS NOT NULL
ORDER BY timestamp DESC;
```
## Network Protocol
The system includes protocol messages for querying event logs from remote peers over the network:
**EventLogRequest** - Request events from a peer
```rust
EventLogRequest {
library_id: Uuid,
requesting_device: Uuid,
since: Option<DateTime<Utc>>,
event_types: Option<Vec<String>>,
correlation_id: Option<Uuid>,
limit: u32,
}
```
**EventLogResponse** - Peer's events
```rust
EventLogResponse {
library_id: Uuid,
responding_device: Uuid,
events: Vec<serde_json::Value>,
}
```
These messages are routed through `SyncProtocolHandler` and handled by `LogSyncHandler::handle_event_log_request()`.
**Status**: Protocol foundation complete. Automatic cross-device fetching can be implemented when needed by creating a cross-device fetcher that sends requests to all online peers and merges responses.
## Debugging Scenarios
### "Why did sync fail?"
```sql
-- Get recent errors
SELECT timestamp, summary, details
FROM sync_event_log
WHERE severity IN ('error', 'warning')
ORDER BY timestamp DESC
LIMIT 20;
```
Common errors:
- **Buffer overflow**: Device offline during heavy sync, buffer filled and dropped updates
- **Stuck dependencies**: Records waiting for FK parents that haven't synced yet
- **Watermark too old**: Device offline > 25 days, forced full resync
### "Why is backfill slow?"
```sql
-- Find the backfill session
SELECT correlation_id, timestamp, duration_ms
FROM sync_event_log
WHERE event_type = 'backfill_session_completed'
ORDER BY timestamp DESC LIMIT 1;
-- Analyze batch ingestion during that session
SELECT timestamp, summary, record_count, duration_ms
FROM sync_event_log
WHERE correlation_id = '<correlation_id_from_above>'
AND event_type = 'batch_ingestion'
ORDER BY timestamp;
```
Look for:
- Decreasing records/second over time → FK dependency bottleneck
- Long gaps between batches → Network issues or peer overload
- High duration_ms → Database performance issue
### "Which peer was selected and why?"
```sql
SELECT timestamp, peer_device_id, details
FROM sync_event_log
WHERE event_type = 'backfill_session_started'
AND details IS NOT NULL
ORDER BY timestamp DESC LIMIT 1;
```
Details JSON shows all candidates with scores based on:
- `is_online` - Peer reachability
- `has_complete_state` - Has full library data
- `latency_ms` - Network round-trip time
- `active_syncs` - Current load
### "Did we lose data?"
```sql
-- Check for buffer overflows
SELECT timestamp, details->>'$.dropped_count' as dropped
FROM sync_event_log
WHERE summary LIKE '%Buffer overflow%'
ORDER BY timestamp DESC;
-- Check for stuck dependencies
SELECT timestamp, details->>'$.total_waiting_updates' as waiting
FROM sync_event_log
WHERE summary LIKE '%Stuck dependencies%'
ORDER BY timestamp DESC;
```
## Batch Aggregation
To prevent database flooding, the system aggregates batch ingestion events:
**Flush Triggers:**
- **Time-based**: Every 30 seconds
- **Size-based**: 10,000 records accumulated
- **State change**: When transitioning between sync states
**Example:** During backfill, 100,000 entries arrive in 10,000-record batches. Instead of 100,000 event log rows, we write 10 batch events:
```
10:15:00 | BatchIngestion | 10,000 records
10:15:30 | BatchIngestion | 10,000 records
10:16:00 | BatchIngestion | 10,000 records
...
```
Each batch event includes:
- Model type breakdown
- Total record count
- Processing duration
- Source peer
## Correlation IDs
Session correlation IDs link all events from a single backfill or catch-up operation:
```
Session abc-123:
├── BackfillSessionStarted (10:15:00)
├── BatchIngestion (10:15:32)
├── BatchIngestion (10:16:02)
├── BatchIngestion (10:16:32)
└── BackfillSessionCompleted (10:16:45)
```
Query all events for a session:
```sql
SELECT * FROM sync_event_log
WHERE correlation_id = 'abc-123'
ORDER BY timestamp;
```
## Retention & Cleanup
Events are automatically pruned after **7 days** to keep `sync.db` small.
Cleanup runs hourly via the unified pruning task (same schedule as tombstone pruning).
Expected storage:
- Normal operation: ~15-50 events/day = ~100-350 rows/week
- Heavy backfill: ~500-2000 events/day during sync
- 7-day retention: ~100-14,000 rows = ~10KB-2MB
## Performance
**Write Performance:**
- Async writes don't block sync operations
- Batch aggregation reduces writes by 10-1000x
- Errors in event logging are swallowed (won't crash sync)
**Query Performance:**
- 5 indexes for common query patterns
- Typical query: <50ms for 1000 events
- Complex filters: <200ms
**Memory Overhead:**
- BatchAggregator: ~100KB for pending batches
- No in-memory caching of historical events (query on demand)
## Event Categories
Events are categorized for filtering:
- **Lifecycle**: State transitions, session start/complete
- **Data Flow**: Batch ingestion, record processing
- **Network**: Peer connections, request/response flow
- **Error**: Failures, warnings, critical issues
## Event Severity
- **Debug**: Detailed flow information (batch ingestion)
- **Info**: Normal operations (state transitions, sessions)
- **Warning**: Attention needed (stuck dependencies, old watermarks)
- **Error**: Critical failures (buffer overflow, backfill failures)
## Implementation Details
**Core Infrastructure:**
- `core/src/infra/sync/event_log/types.rs` - Event type definitions
- `core/src/infra/sync/event_log/logger.rs` - Database persistence + query
- `core/src/infra/sync/event_log/aggregator.rs` - Batch aggregation
- `core/src/infra/sync/event_log/query.rs` - Query builder
**Emission Points:**
- `core/src/service/sync/metrics/collector.rs` - State transitions
- `core/src/service/sync/backfill.rs` - Backfill sessions, peer selection
- `core/src/service/sync/peer.rs` - Buffer overflows, stuck dependencies
- `core/src/service/sync/state.rs` - Buffer drop tracking
**Integration:**
- `core/src/service/sync/mod.rs` - Logger initialization, cleanup
- `core/src/ops/sync/get_event_log/` - rspc API endpoint
- Table creation in `core/src/infra/sync/peer_log.rs`
## Testing
Run the event log test suite:
```bash
cargo test -p sd-core --test sync_event_log_test
```
Tests verify:
- State transitions are logged
- Correlation IDs work correctly
- Query API filtering
- Retention cleanup
- Persistence across restarts
- Batch aggregation
## Export Formats
### JSON Format
Structured output for programmatic processing:
```json
[
{
"timestamp": "2025-12-03T10:15:00Z",
"event_type": "StateTransition",
"severity": "Info",
"summary": "Ready → Backfilling",
"details": { "reason": "new device joined" },
"correlation_id": null,
"duration_ms": null
},
{
"timestamp": "2025-12-03T10:15:32Z",
"event_type": "BatchIngestion",
"severity": "Debug",
"summary": "Ingested batch of 45,000 records (25k entries, 15k tags, 5k locations) in 28s",
"record_count": 45000,
"duration_ms": 28000
}
]
```
### SQL Format
INSERT statements for database import:
```sql
-- Sync Event Log Export
-- Generated: 2025-12-03T10:20:00Z
INSERT INTO sync_event_log (...) VALUES (...);
INSERT INTO sync_event_log (...) VALUES (...);
```
### Markdown Format
Human-readable table format:
```markdown
# Sync Event Log
**Exported**: 2025-12-03T10:20:00Z
**Event Count**: 15
| Timestamp | Event Type | Severity | Summary |
|-----------|------------|----------|---------|
| 2025-12-03 10:15:00 | StateTransition | Info | Ready → Backfilling |
| 2025-12-03 10:15:32 | BatchIngestion | Debug | Ingested 45k records |
```
## Troubleshooting
**Events not appearing:**
- Check sync service is running: `sync_service.start().await`
- Verify event logger initialized: `sync_service.event_logger()`
- Check sync.db exists: `library.path().join("sync.db")`
**Database growing too large:**
- Verify pruning task is running (hourly by default)
- Check retention period: 7 days default
- Query old events: `SELECT COUNT(*) FROM sync_event_log WHERE timestamp < date('now', '-7 days')`
**Missing batch events:**
- Batch aggregator flushes every 30 seconds
- Manually flush: `sync_service.batch_aggregator().flush_all().await`
- Check aggregator task is running (started with sync service)
## Related Documentation
- [Library Sync](/docs/core/library-sync) - Sync protocol and architecture
- [Testing](/docs/core/testing) - Sync integration tests
- [Events](/docs/core/events) - General event system