18 KiB
Spacedrive Whitepaper Revision Plan
Last Updated: 2025-01-21 Purpose: Track architectural updates needed to align the whitepaper with the V2 implementation.
Editorial Guidelines for Updates
Writing Style
- Architecture-focused: Explain WHAT systems do and WHY design decisions were made
- No code examples: Exception for SdPath enum (core abstraction) and one SDK example if absolutely necessary
- No marketing language: Avoid superlatives like "blazing fast", "revolutionary", "game-changing"
- No fake statistics: Only cite real benchmarks from the indexing section
- No implementation status: Never mention "planned", "in progress", "coming soon" - write as if complete
- Technical precision: Use exact terminology, avoid vague descriptions
- Clarity over cleverness: Straightforward explanations trump eloquent prose
What NOT to Include
- Performance metrics beyond indexing benchmarks (we haven't measured them)
- Code listings (except SdPath and possibly one SDK example)
- Comparisons claiming "X% faster than Y" without data
- Feature timelines or roadmap speculation
- Implementation details (how it's coded vs. how it's architected)
Format Consistency
- Use
\textbf{}for emphasis, not italics in technical sections - Keep Key Takeaways boxes concise (3-4 bullets max)
- Diagrams over lengthy prose where possible
- Section cross-references using
\ref{}consistently
Status Legend
- CRITICAL - Architecturally incorrect, must fix
- MAJOR - Missing significant architectural details
- MINOR - Terminology tweaks or small additions
- REMOVE - Content to delete or minimize
Phase 1: Critical Architectural Corrections
1. Library Sync Architecture (Section 4.5.1)
Lines: 1266-1318 Problem: Describes sync too abstractly, missing the sophisticated watermark system that makes it reliable.
Required Changes:
-
Per-Resource Watermark Architecture
- Explain sync tracks progress independently per resource type (location, entry, volume, tag, etc.)
- Enables surgical recovery: only re-sync resources with detected gaps
- Prevents cross-contamination: advancing location watermark doesn't affect entry sync
-
Dual Watermark Strategy
- Cursor watermark: Advances optimistically with each received record
- Validated watermark: Only advances after count verification passes
- On gap detection, reset cursor to validated watermark for surgical recovery
-
Integrity Validation Mechanisms
- Count-based gap detection: Compare expected vs. actual record counts per resource
- Hash-based update detection: Aggregated hash of resource data catches missed updates
- Both run during watermark exchange between peers
-
Escalation Strategy
- Normal flow: Incremental catch-up using watermarks
- After 5 consecutive catch-up failures: Escalate to full backfill
- Backfill completes → Reset watermarks → Return to incremental mode
-
Watermark Exchange Protocol
- Bidirectional negotiation when devices reconnect
- Each device sends: watermarks + counts + hashes for all resources
- Peer responds with: actual counts/hashes + needs_catchup flags
- Surgical recovery initiated for mismatched resources only
Why This Matters: The watermark system is why Spacedrive can efficiently sync massive libraries without full re-indexing after network interruptions. It's a key architectural innovation over naive "send everything" approaches.
Remove: Vague references to "efficient state-based replication" without explaining the mechanism.
2. WASM Extension System (Section 4.9.2)
Lines: 2601-2655 Problem: Wire registry integration is incorrect - that's not the current plan. Need to focus on the actual WASM sandbox architecture.
Required Changes:
-
Remove: All references to "single host function routing to Wire registry"
-
Emphasize: WASM provides security through complete sandboxing
-
Focus on: Capability-based permission model
- Extensions declare required permissions upfront
- Permissions: ReadEntries, WriteSidecars, UseModel, RegisterModel, DispatchJobs
- Rate limiting per extension (requests/minute)
-
Memory Systems for AI Agents
- TemporalMemory: Time-ordered event stream, supports
since()queries - AssociativeMemory: Semantic similarity search, similarity threshold filtering
- WorkingMemory: Current state and active plans
- Agents maintain persistent knowledge across restarts
- TemporalMemory: Time-ordered event stream, supports
-
Event-Driven Architecture
#[on_startup]: Initialization hook#[on_event(EntryCreated)]: React to filesystem events#[scheduled(cron = "...")]: Time-based triggers#[filter("...")]: Entry filtering expressions
Keep ONE SDK Example: Show Photos extension structure to illustrate event-driven agents:
#[agent]
impl Photos {
#[on_event(EntryCreated)]
#[filter(".extension().is_image()")]
pub async fn on_new_photo(entry: Entry, ctx: &AgentContext<PhotosMind>);
}
Why This Matters: The extension architecture enables domain-specific intelligence (Photos, Finance, Organization agents) while maintaining security through sandboxing.
3. Indexing Engine Resumability (Section 4.3)
Lines: 738-872 Problem: Describes "multi-phase" abstractly without explaining what makes jobs actually resumable.
Required Changes:
-
Phase Separation Rationale
- Each phase has distinct failure modes and I/O characteristics
- Discovery: Filesystem traversal (fails on permissions)
- Processing: Database writes (fails on constraint violations)
- Aggregation: Hierarchical calculations (fails on corrupted references)
- Content ID: File hashing (fails on file locks)
-
Checkpoint Architecture
- Jobs checkpoint after each batch (default: 1000 entries)
- State serialized with MessagePack (compact binary format)
- On crash/restart: Deserialize state → Resume from last checkpoint
- Checkpoint includes: phase, batch cursor, processed entry IDs
-
Resumability Flow
- Job interrupted (crash, user cancel, device offline)
- State persisted to
jobs.dbwith last completed phase - On restart: Load serialized state from database
- Jump to last completed phase, skip processed entries
- Continue from checkpoint cursor
-
Ephemeral Mode Architecture
- In-memory Entry records for non-indexed paths
- Enables browsing external drives without permanent indexing
- Three use cases:
- Exploring removable media before adding as Location
- Remote filesystem browsing (peer device)
- "Lazy refresh" during directory navigation
Why This Matters: Resumability is critical for mobile devices and large libraries where indexing can take hours and may be interrupted multiple times.
Enhance Diagram (Fig 4.4): Add checkpoint persistence arrows and resumability flow.
Phase 2: Major Architectural Expansions
4. Content Identity Two-Tier Hashing (Section 4.2)
Lines: 643-735 Problem: Mentions "integrity hash" but doesn't explain when/why it's generated separately.
Required Changes:
-
Performance vs. Security Trade-off
- Initial indexing: Only sampled hash (first 16 chars of BLAKE3)
- Enables ~100× faster indexing (58KB read vs. full file)
- Full integrity hash generated lazily by background ValidationJobs
-
Validation Architecture
- ValidationJobs run during idle periods
- Generate complete BLAKE3 hash of entire file
- Compare against expected content_id
- Mismatch detection → Corruption alert + restoration from redundant copies
-
When Full Integrity Matters
- Large file transfers (verify no corruption)
- Backup verification (ensure bit-perfect copy)
- Forensic analysis (cryptographic proof of content)
- Security-sensitive files (detect tampering)
Why This Matters: Separating "identity" (for deduplication) from "integrity" (for verification) allows instant indexing while preserving cryptographic guarantees when needed.
5. Action System Simulation Details (Section 4.4)
Lines: 945-1236 Problem: Describes preview/commit but not HOW simulation achieves accuracy.
Required Changes:
-
Index-Based Simulation Architecture
- All predictions via SQL queries against VDFS index
- No filesystem access during preview
- Complete knowledge: Every file's size, location, relationships known
-
Content-Aware Path Resolution
- For
SdPath::Contentoperations, resolver evaluates all instances - Cost function weighs:
- Locality: Local device = 0 cost (instant)
- Network proximity: Iroh provides real-time latency measurements
- Storage tier: SSD prioritized over HDD (from PhysicalClass)
- Device availability: Only online devices considered
- Lowest-cost path selected automatically
- For
-
Conflict Detection Categories
- Storage constraints: Calculate exact space requirements, verify availability
- Permission violations: Check write access before committing
- Path conflicts: Detect naming collisions in target directory
- Circular references: Prevent moving parent into descendant
- Resource limitations: Estimate memory/bandwidth vs. device capabilities
-
Storage Tier Warnings
- Simulation detects PhysicalClass/LogicalClass mismatches
- Example: User marks folder as "Hot" but it's on Cold HDD
- Preview shows: "Warning: Operation targets hot location on slow archive drive"
Why This Matters: The simulation engine prevents data loss and user frustration by catching problems before execution. Its power comes from having a complete index.
6. Networking ALPN Multiplexing (Section 4.5.2)
Lines: 1320-1424 Problem: Mentions Iroh but doesn't explain why protocol consolidation matters.
Required Changes:
-
ALPN Protocol Multiplexing Benefits
- Single QUIC connection per device pair
- Multiple protocols as streams: pairing, sync, file transfer, messaging
- Each protocol identified by ALPN string (e.g., "spacedrive/sync/1")
- Stream-level routing, not connection-level
-
Connection Efficiency Gains
- Single TCP/QUIC handshake instead of N handshakes
- Shared congestion control across all operations
- Connection reuse eliminates re-establishment overhead
- Result: Sub-2-second connection establishment
-
Deterministic Connection Initiation
- Only device with lower NodeId initiates outbound connection
- Prevents race condition: Both devices trying to connect simultaneously
- Simpler state machine: Each device knows its role
-
Pairing Security Model
- BIP39 mnemonic codes (12 words from 256-bit secret)
- Challenge-response handshake (4 messages)
- Ed25519 signatures for authentication
- Prevents MITM during initial pairing
Why This Matters: Treating all protocols as streams on one connection eliminates coordination overhead and connection races in P2P networks.
Phase 3: Important Context Additions
7. Ephemeral Mode Use Cases (Section 4.1.2)
Lines: 393-394 Problem: One-sentence mention doesn't convey the architectural significance.
Add (1 paragraph):
- Three Ephemeral Scenarios
- Browsing external drives before formal indexing
- Exploring peer device filesystems remotely
- "Lazy refresh" during directory navigation
- Architectural Benefit: Immediate metadata capability (tagging, organizing) even for unindexed files
8. Lightweight Embedding Models (Section 4.7)
Lines: 1797-1948 Problem: Doesn't emphasize these are SMALL models, not LLMs.
Clarify:
-
Model Scale Reality
- all-MiniLM-L6-v2: 22M parameters, 384 dimensions, 5MB model size
- NOT GPT-scale (billions of parameters)
- Specialized for semantic similarity, not text generation
-
Performance Characteristics
- Runs efficiently on CPU (no GPU required)
- Processes thousands of files/second during indexing
- Real-time query embedding (<40ms)
Why This Matters: The architecture is practical BECAUSE it doesn't require massive models or specialized hardware.
9. Volume Classification Benefits (Section 4.6)
Lines: 1950-2075 Problem: Describes classification but not why the complexity matters.
Add:
-
Platform-Specific Chaos
- macOS: APFS containers create multiple volumes from one physical drive
- Linux: Virtual filesystems (/proc, /sys, /dev) clutter mount list
- Windows: Hidden recovery partitions and system volumes
-
Auto-Tracking Intelligence
- Filter ~10 system volumes → Show ~3 user-relevant volumes
- Present semantic names: "Primary", "External", "Network"
- Hide: System, VM, Preboot, Update partitions
Why This Matters: Users see cleaned, meaningful volume lists instead of technical chaos. Reduces cognitive load.
10. Closure Table Performance (Section 4.8 / Database)
Lines: 938-944 Problem: Mentions closure table but not the performance win.
Add:
-
Traditional Hierarchical Query Problem
- Recursive CTEs: Multiple passes over data
- LIKE-based path matching: O(n) table scan
- Performance degrades with tree depth
-
Closure Table Solution
- Pre-computed ancestor-descendant relationships
- All hierarchy queries → Single indexed join
- O(1) operations: Directory listing, size calculation, ancestor lookup
-
Trade-off
- Additional storage: O(d × n) where d = tree depth, n = entries
- Transactional updates: Insert self-closure + inherit parent closures
- Benefit: Million-file libraries with sub-100ms hierarchy queries
Why This Matters: This is why Spacedrive maintains responsiveness with massive libraries while traditional file managers slow down.
Phase 4: Terminology & Accuracy Corrections
11. HLC Usage Clarification (Multiple Sections)
Problem: Paper sometimes implies HLC is used for all sync.
Global Find/Replace Needed:
- Device-owned data (entries, locations, volumes): Timestamp-based watermarks
- Shared data (tags, collections, user metadata): HLC-based log
- Be explicit about which domain uses which mechanism
- Update Section 4.5.1 sync domain table to clarify
12. Testing Framework Detail (Section 7)
Lines: 2366-2415 Problem: Underplays sophistication of distributed testing.
Add:
-
Subprocess Testing Architecture
- Tests spawn multiple Rust processes, each simulating a device
- Environment variables control device roles (TEST_ROLE=alice)
- Real P2P communication over loopback
-
Realistic Scenarios Tested
- Full device pairing flows with authentication
- Conflict detection and resolution
- Network interruption recovery
- Cross-device file transfers
-
Scale: 43 integration tests validate distributed system behavior that would be impossible with unit tests alone
Why This Matters: Validates the distributed system ACTUALLY works, not just individual components in isolation.
Phase 5: Content Removal & Cleanup
13. Remove Unnecessary Code Listings
Keep ONLY:
- SdPath enum (lines 458-474) - core abstraction
- ONE SDK example for agent event handler (if needed for clarity)
Remove:
- Rust trait definitions (Job, JobHandler, etc.)
- SQL schema code
- File type TOML examples
- JSON format examples
- All other implementation snippets
Reasoning: Paper explains architecture, not implementation. Code distracts from concepts.
14. Remove Benchmark Claims Outside Indexing
Scan for and remove:
- "Sub-100ms search" (not benchmarked)
- "8,500 files/sec" (only indexing is benchmarked)
- Network throughput numbers (not measured)
- Any "X% faster" comparisons without data
Keep:
- Table 4.1 (Indexing benchmark data) - real measurements
- Generic statements: "sub-second response times" (not specific numbers)
Phase 6: Diagram Improvements
15. Sync Architecture Diagram (Section 4.5.1)
Current: Text-heavy explanation Improve: Visual diagram showing:
- Two sync domains (device-owned vs. shared)
- Watermark exchange protocol flow
- Escalation decision tree (catch-up → backfill)
16. Indexing Pipeline Diagram (Section 4.3)
Current (Fig 4.4): Basic phase flow Enhance:
- Checkpoint persistence after each phase
- Resumability arrows showing restart path
- Ephemeral mode as separate branch
Document Conventions
When Writing Updates
- Start with "Why": Explain the problem being solved
- Architecture over Implementation: Focus on WHAT and WHY, not HOW
- Be Precise: Use exact technical terms, avoid vague descriptions
- Cross-Reference: Link related sections with
\ref{} - Diagrams > Prose: Visualize complex interactions when possible
Review Checklist
- No marketing language or superlatives?
- No fake statistics or unmeasured performance claims?
- No code examples (except SdPath + maybe one SDK example)?
- Explains WHY design decisions were made?
- Technically accurate and precise?
- Consistent terminology with glossary (Appendix)?
Priority Order for Implementation
Week 1: Critical Fixes
- Section 4.5.1 - Library Sync (most architecturally wrong)
- Section 4.9.2 - WASM Extensions (remove incorrect Wire info)
- Section 4.3 - Indexing Resumability (missing key details)
Week 2: Major Expansions 4. Section 4.2 - Two-Tier Hashing 5. Section 4.4 - Simulation Engine 6. Section 4.5.2 - ALPN Multiplexing
Week 3: Polish 7-12. Minor additions and terminology fixes 13-14. Remove unnecessary code and fake benchmarks 15-16. Improve diagrams
Notes
- PhysicalClass/LogicalClass: Keep in paper (simulation engine needs it)
- Extension Wire integration: Removed from plan (not current architecture)
- Benchmarks: Only indexing section has real measurements
- Writing style: Technical precision, no marketing fluff