spacedrive/whitepaper/REVISION_PLAN.md
2025-11-22 05:12:55 -08:00

18 KiB
Raw Blame History

Spacedrive Whitepaper Revision Plan

Last Updated: 2025-01-21 Purpose: Track architectural updates needed to align the whitepaper with the V2 implementation.


Editorial Guidelines for Updates

Writing Style

  • Architecture-focused: Explain WHAT systems do and WHY design decisions were made
  • No code examples: Exception for SdPath enum (core abstraction) and one SDK example if absolutely necessary
  • No marketing language: Avoid superlatives like "blazing fast", "revolutionary", "game-changing"
  • No fake statistics: Only cite real benchmarks from the indexing section
  • No implementation status: Never mention "planned", "in progress", "coming soon" - write as if complete
  • Technical precision: Use exact terminology, avoid vague descriptions
  • Clarity over cleverness: Straightforward explanations trump eloquent prose

What NOT to Include

  • Performance metrics beyond indexing benchmarks (we haven't measured them)
  • Code listings (except SdPath and possibly one SDK example)
  • Comparisons claiming "X% faster than Y" without data
  • Feature timelines or roadmap speculation
  • Implementation details (how it's coded vs. how it's architected)

Format Consistency

  • Use \textbf{} for emphasis, not italics in technical sections
  • Keep Key Takeaways boxes concise (3-4 bullets max)
  • Diagrams over lengthy prose where possible
  • Section cross-references using \ref{} consistently

Status Legend

  • CRITICAL - Architecturally incorrect, must fix
  • MAJOR - Missing significant architectural details
  • MINOR - Terminology tweaks or small additions
  • REMOVE - Content to delete or minimize

Phase 1: Critical Architectural Corrections

1. Library Sync Architecture (Section 4.5.1)

Lines: 1266-1318 Problem: Describes sync too abstractly, missing the sophisticated watermark system that makes it reliable.

Required Changes:

  • Per-Resource Watermark Architecture

    • Explain sync tracks progress independently per resource type (location, entry, volume, tag, etc.)
    • Enables surgical recovery: only re-sync resources with detected gaps
    • Prevents cross-contamination: advancing location watermark doesn't affect entry sync
  • Dual Watermark Strategy

    • Cursor watermark: Advances optimistically with each received record
    • Validated watermark: Only advances after count verification passes
    • On gap detection, reset cursor to validated watermark for surgical recovery
  • Integrity Validation Mechanisms

    • Count-based gap detection: Compare expected vs. actual record counts per resource
    • Hash-based update detection: Aggregated hash of resource data catches missed updates
    • Both run during watermark exchange between peers
  • Escalation Strategy

    • Normal flow: Incremental catch-up using watermarks
    • After 5 consecutive catch-up failures: Escalate to full backfill
    • Backfill completes → Reset watermarks → Return to incremental mode
  • Watermark Exchange Protocol

    • Bidirectional negotiation when devices reconnect
    • Each device sends: watermarks + counts + hashes for all resources
    • Peer responds with: actual counts/hashes + needs_catchup flags
    • Surgical recovery initiated for mismatched resources only

Why This Matters: The watermark system is why Spacedrive can efficiently sync massive libraries without full re-indexing after network interruptions. It's a key architectural innovation over naive "send everything" approaches.

Remove: Vague references to "efficient state-based replication" without explaining the mechanism.


2. WASM Extension System (Section 4.9.2)

Lines: 2601-2655 Problem: Wire registry integration is incorrect - that's not the current plan. Need to focus on the actual WASM sandbox architecture.

Required Changes:

  • Remove: All references to "single host function routing to Wire registry"

  • Emphasize: WASM provides security through complete sandboxing

  • Focus on: Capability-based permission model

    • Extensions declare required permissions upfront
    • Permissions: ReadEntries, WriteSidecars, UseModel, RegisterModel, DispatchJobs
    • Rate limiting per extension (requests/minute)
  • Memory Systems for AI Agents

    • TemporalMemory: Time-ordered event stream, supports since() queries
    • AssociativeMemory: Semantic similarity search, similarity threshold filtering
    • WorkingMemory: Current state and active plans
    • Agents maintain persistent knowledge across restarts
  • Event-Driven Architecture

    • #[on_startup]: Initialization hook
    • #[on_event(EntryCreated)]: React to filesystem events
    • #[scheduled(cron = "...")]: Time-based triggers
    • #[filter("...")]: Entry filtering expressions

Keep ONE SDK Example: Show Photos extension structure to illustrate event-driven agents:

#[agent]
impl Photos {
    #[on_event(EntryCreated)]
    #[filter(".extension().is_image()")]
    pub async fn on_new_photo(entry: Entry, ctx: &AgentContext<PhotosMind>);
}

Why This Matters: The extension architecture enables domain-specific intelligence (Photos, Finance, Organization agents) while maintaining security through sandboxing.


3. Indexing Engine Resumability (Section 4.3)

Lines: 738-872 Problem: Describes "multi-phase" abstractly without explaining what makes jobs actually resumable.

Required Changes:

  • Phase Separation Rationale

    • Each phase has distinct failure modes and I/O characteristics
    • Discovery: Filesystem traversal (fails on permissions)
    • Processing: Database writes (fails on constraint violations)
    • Aggregation: Hierarchical calculations (fails on corrupted references)
    • Content ID: File hashing (fails on file locks)
  • Checkpoint Architecture

    • Jobs checkpoint after each batch (default: 1000 entries)
    • State serialized with MessagePack (compact binary format)
    • On crash/restart: Deserialize state → Resume from last checkpoint
    • Checkpoint includes: phase, batch cursor, processed entry IDs
  • Resumability Flow

    1. Job interrupted (crash, user cancel, device offline)
    2. State persisted to jobs.db with last completed phase
    3. On restart: Load serialized state from database
    4. Jump to last completed phase, skip processed entries
    5. Continue from checkpoint cursor
  • Ephemeral Mode Architecture

    • In-memory Entry records for non-indexed paths
    • Enables browsing external drives without permanent indexing
    • Three use cases:
      • Exploring removable media before adding as Location
      • Remote filesystem browsing (peer device)
      • "Lazy refresh" during directory navigation

Why This Matters: Resumability is critical for mobile devices and large libraries where indexing can take hours and may be interrupted multiple times.

Enhance Diagram (Fig 4.4): Add checkpoint persistence arrows and resumability flow.


Phase 2: Major Architectural Expansions

4. Content Identity Two-Tier Hashing (Section 4.2)

Lines: 643-735 Problem: Mentions "integrity hash" but doesn't explain when/why it's generated separately.

Required Changes:

  • Performance vs. Security Trade-off

    • Initial indexing: Only sampled hash (first 16 chars of BLAKE3)
    • Enables ~100× faster indexing (58KB read vs. full file)
    • Full integrity hash generated lazily by background ValidationJobs
  • Validation Architecture

    • ValidationJobs run during idle periods
    • Generate complete BLAKE3 hash of entire file
    • Compare against expected content_id
    • Mismatch detection → Corruption alert + restoration from redundant copies
  • When Full Integrity Matters

    • Large file transfers (verify no corruption)
    • Backup verification (ensure bit-perfect copy)
    • Forensic analysis (cryptographic proof of content)
    • Security-sensitive files (detect tampering)

Why This Matters: Separating "identity" (for deduplication) from "integrity" (for verification) allows instant indexing while preserving cryptographic guarantees when needed.


5. Action System Simulation Details (Section 4.4)

Lines: 945-1236 Problem: Describes preview/commit but not HOW simulation achieves accuracy.

Required Changes:

  • Index-Based Simulation Architecture

    • All predictions via SQL queries against VDFS index
    • No filesystem access during preview
    • Complete knowledge: Every file's size, location, relationships known
  • Content-Aware Path Resolution

    • For SdPath::Content operations, resolver evaluates all instances
    • Cost function weighs:
      • Locality: Local device = 0 cost (instant)
      • Network proximity: Iroh provides real-time latency measurements
      • Storage tier: SSD prioritized over HDD (from PhysicalClass)
      • Device availability: Only online devices considered
    • Lowest-cost path selected automatically
  • Conflict Detection Categories

    • Storage constraints: Calculate exact space requirements, verify availability
    • Permission violations: Check write access before committing
    • Path conflicts: Detect naming collisions in target directory
    • Circular references: Prevent moving parent into descendant
    • Resource limitations: Estimate memory/bandwidth vs. device capabilities
  • Storage Tier Warnings

    • Simulation detects PhysicalClass/LogicalClass mismatches
    • Example: User marks folder as "Hot" but it's on Cold HDD
    • Preview shows: "Warning: Operation targets hot location on slow archive drive"

Why This Matters: The simulation engine prevents data loss and user frustration by catching problems before execution. Its power comes from having a complete index.


6. Networking ALPN Multiplexing (Section 4.5.2)

Lines: 1320-1424 Problem: Mentions Iroh but doesn't explain why protocol consolidation matters.

Required Changes:

  • ALPN Protocol Multiplexing Benefits

    • Single QUIC connection per device pair
    • Multiple protocols as streams: pairing, sync, file transfer, messaging
    • Each protocol identified by ALPN string (e.g., "spacedrive/sync/1")
    • Stream-level routing, not connection-level
  • Connection Efficiency Gains

    • Single TCP/QUIC handshake instead of N handshakes
    • Shared congestion control across all operations
    • Connection reuse eliminates re-establishment overhead
    • Result: Sub-2-second connection establishment
  • Deterministic Connection Initiation

    • Only device with lower NodeId initiates outbound connection
    • Prevents race condition: Both devices trying to connect simultaneously
    • Simpler state machine: Each device knows its role
  • Pairing Security Model

    • BIP39 mnemonic codes (12 words from 256-bit secret)
    • Challenge-response handshake (4 messages)
    • Ed25519 signatures for authentication
    • Prevents MITM during initial pairing

Why This Matters: Treating all protocols as streams on one connection eliminates coordination overhead and connection races in P2P networks.


Phase 3: Important Context Additions

7. Ephemeral Mode Use Cases (Section 4.1.2)

Lines: 393-394 Problem: One-sentence mention doesn't convey the architectural significance.

Add (1 paragraph):

  • Three Ephemeral Scenarios
    • Browsing external drives before formal indexing
    • Exploring peer device filesystems remotely
    • "Lazy refresh" during directory navigation
  • Architectural Benefit: Immediate metadata capability (tagging, organizing) even for unindexed files

8. Lightweight Embedding Models (Section 4.7)

Lines: 1797-1948 Problem: Doesn't emphasize these are SMALL models, not LLMs.

Clarify:

  • Model Scale Reality

    • all-MiniLM-L6-v2: 22M parameters, 384 dimensions, 5MB model size
    • NOT GPT-scale (billions of parameters)
    • Specialized for semantic similarity, not text generation
  • Performance Characteristics

    • Runs efficiently on CPU (no GPU required)
    • Processes thousands of files/second during indexing
    • Real-time query embedding (<40ms)

Why This Matters: The architecture is practical BECAUSE it doesn't require massive models or specialized hardware.


9. Volume Classification Benefits (Section 4.6)

Lines: 1950-2075 Problem: Describes classification but not why the complexity matters.

Add:

  • Platform-Specific Chaos

    • macOS: APFS containers create multiple volumes from one physical drive
    • Linux: Virtual filesystems (/proc, /sys, /dev) clutter mount list
    • Windows: Hidden recovery partitions and system volumes
  • Auto-Tracking Intelligence

    • Filter ~10 system volumes → Show ~3 user-relevant volumes
    • Present semantic names: "Primary", "External", "Network"
    • Hide: System, VM, Preboot, Update partitions

Why This Matters: Users see cleaned, meaningful volume lists instead of technical chaos. Reduces cognitive load.


10. Closure Table Performance (Section 4.8 / Database)

Lines: 938-944 Problem: Mentions closure table but not the performance win.

Add:

  • Traditional Hierarchical Query Problem

    • Recursive CTEs: Multiple passes over data
    • LIKE-based path matching: O(n) table scan
    • Performance degrades with tree depth
  • Closure Table Solution

    • Pre-computed ancestor-descendant relationships
    • All hierarchy queries → Single indexed join
    • O(1) operations: Directory listing, size calculation, ancestor lookup
  • Trade-off

    • Additional storage: O(d × n) where d = tree depth, n = entries
    • Transactional updates: Insert self-closure + inherit parent closures
    • Benefit: Million-file libraries with sub-100ms hierarchy queries

Why This Matters: This is why Spacedrive maintains responsiveness with massive libraries while traditional file managers slow down.


Phase 4: Terminology & Accuracy Corrections

11. HLC Usage Clarification (Multiple Sections)

Problem: Paper sometimes implies HLC is used for all sync.

Global Find/Replace Needed:

  • Device-owned data (entries, locations, volumes): Timestamp-based watermarks
  • Shared data (tags, collections, user metadata): HLC-based log
  • Be explicit about which domain uses which mechanism
  • Update Section 4.5.1 sync domain table to clarify

12. Testing Framework Detail (Section 7)

Lines: 2366-2415 Problem: Underplays sophistication of distributed testing.

Add:

  • Subprocess Testing Architecture

    • Tests spawn multiple Rust processes, each simulating a device
    • Environment variables control device roles (TEST_ROLE=alice)
    • Real P2P communication over loopback
  • Realistic Scenarios Tested

    • Full device pairing flows with authentication
    • Conflict detection and resolution
    • Network interruption recovery
    • Cross-device file transfers
  • Scale: 43 integration tests validate distributed system behavior that would be impossible with unit tests alone

Why This Matters: Validates the distributed system ACTUALLY works, not just individual components in isolation.


Phase 5: Content Removal & Cleanup

13. Remove Unnecessary Code Listings

Keep ONLY:

  • SdPath enum (lines 458-474) - core abstraction
  • ONE SDK example for agent event handler (if needed for clarity)

Remove:

  • Rust trait definitions (Job, JobHandler, etc.)
  • SQL schema code
  • File type TOML examples
  • JSON format examples
  • All other implementation snippets

Reasoning: Paper explains architecture, not implementation. Code distracts from concepts.


14. Remove Benchmark Claims Outside Indexing

Scan for and remove:

  • "Sub-100ms search" (not benchmarked)
  • "8,500 files/sec" (only indexing is benchmarked)
  • Network throughput numbers (not measured)
  • Any "X% faster" comparisons without data

Keep:

  • Table 4.1 (Indexing benchmark data) - real measurements
  • Generic statements: "sub-second response times" (not specific numbers)

Phase 6: Diagram Improvements

15. Sync Architecture Diagram (Section 4.5.1)

Current: Text-heavy explanation Improve: Visual diagram showing:

  • Two sync domains (device-owned vs. shared)
  • Watermark exchange protocol flow
  • Escalation decision tree (catch-up → backfill)

16. Indexing Pipeline Diagram (Section 4.3)

Current (Fig 4.4): Basic phase flow Enhance:

  • Checkpoint persistence after each phase
  • Resumability arrows showing restart path
  • Ephemeral mode as separate branch

Document Conventions

When Writing Updates

  1. Start with "Why": Explain the problem being solved
  2. Architecture over Implementation: Focus on WHAT and WHY, not HOW
  3. Be Precise: Use exact technical terms, avoid vague descriptions
  4. Cross-Reference: Link related sections with \ref{}
  5. Diagrams > Prose: Visualize complex interactions when possible

Review Checklist

  • No marketing language or superlatives?
  • No fake statistics or unmeasured performance claims?
  • No code examples (except SdPath + maybe one SDK example)?
  • Explains WHY design decisions were made?
  • Technically accurate and precise?
  • Consistent terminology with glossary (Appendix)?

Priority Order for Implementation

Week 1: Critical Fixes

  1. Section 4.5.1 - Library Sync (most architecturally wrong)
  2. Section 4.9.2 - WASM Extensions (remove incorrect Wire info)
  3. Section 4.3 - Indexing Resumability (missing key details)

Week 2: Major Expansions 4. Section 4.2 - Two-Tier Hashing 5. Section 4.4 - Simulation Engine 6. Section 4.5.2 - ALPN Multiplexing

Week 3: Polish 7-12. Minor additions and terminology fixes 13-14. Remove unnecessary code and fake benchmarks 15-16. Improve diagrams


Notes

  • PhysicalClass/LogicalClass: Keep in paper (simulation engine needs it)
  • Extension Wire integration: Removed from plan (not current architecture)
  • Benchmarks: Only indexing section has real measurements
  • Writing style: Technical precision, no marketing fluff