cleanup

2025-12-11 20:15:30 +01:00 · 2025-12-01 13:30:30 -08:00 · 2025-12-01 13:30:30 -08:00 · 73befe0f0f
commit 73befe0f0f
parent 4288dd3555
4 changed files with 1 additions and 491 deletions
--- a/README.md
+++ b/README.md
@ -32,7 +32,7 @@ Organize files across multiple devices, clouds, and platforms from a single inte
 ---
-> [!NOTE] > **Hi, Jamie here!** This is Spacedrive v2 (December 2025)—a complete ground-up rewrite.
+> [!NOTE] **Hi, Jamie here!** This is Spacedrive v2 (December 2025)—a complete ground-up rewrite.
 >
 > After development of the original alpha version stopped in January this year, left with the hard lessons of the incomplete alpha, I set out to completely rebuild Spacedrive from the ground up. The first release **2.0.0-pre.1** is coming before Christmas.
 >
--- a/core/crush.json
+++ b/core/crush.json
@ -1,18 +0,0 @@
 {
  "$schema": "https://charm.land/crush.json",
  "providers": {
    "lmstudio": {
      "name": "LM Studio",
      "base_url": "http://localhost:1234/v1/",
      "type": "openai",
      "models": [
        {
          "name": "Qwen3 30B MOE",
          "id": "local-model",
          "context_window": 131072,
          "default_max_tokens": 20000
        }
      ]
    }
  }
 }
--- a/core/master_key
+++ b/core/master_key
@ -1 +0,0 @@
 1dd5b461c2ca70e1d328ece908142f6ab6f531dbfc55e320e0d2d503b0177eed
--- a/whitepaper/REVISION_PLAN.md
+++ b/whitepaper/REVISION_PLAN.md
@ -1,471 +0,0 @@
 # Spacedrive Whitepaper Revision Plan
 **Last Updated:** 2025-01-21
 **Purpose:** Track architectural updates needed to align the whitepaper with the V2 implementation.
 ---
 ## Editorial Guidelines for Updates
 ### Writing Style
 - **Architecture-focused**: Explain WHAT systems do and WHY design decisions were made
 - **No code examples**: Exception for SdPath enum (core abstraction) and one SDK example if absolutely necessary
 - **No marketing language**: Avoid superlatives like "blazing fast", "revolutionary", "game-changing"
 - **No fake statistics**: Only cite real benchmarks from the indexing section
 - **No implementation status**: Never mention "planned", "in progress", "coming soon" - write as if complete
 - **Technical precision**: Use exact terminology, avoid vague descriptions
 - **Clarity over cleverness**: Straightforward explanations trump eloquent prose
 ### What NOT to Include
 - Performance metrics beyond indexing benchmarks (we haven't measured them)
 - Code listings (except SdPath and possibly one SDK example)
 - Comparisons claiming "X% faster than Y" without data
 - Feature timelines or roadmap speculation
 - Implementation details (how it's coded vs. how it's architected)
 ### Format Consistency
 - Use `\textbf{}` for emphasis, not italics in technical sections
 - Keep Key Takeaways boxes concise (3-4 bullets max)
 - Diagrams over lengthy prose where possible
 - Section cross-references using `\ref{}` consistently
 ---
 ## Status Legend
 - **CRITICAL** - Architecturally incorrect, must fix
 - **MAJOR** - Missing significant architectural details
 - **MINOR** - Terminology tweaks or small additions
 - **REMOVE** - Content to delete or minimize
 ---
 ## Phase 1: Critical Architectural Corrections
 ### 1. Library Sync Architecture (Section 4.5.1)
 **Lines:** 1266-1318
 **Problem:** Describes sync too abstractly, missing the sophisticated watermark system that makes it reliable.
 **Required Changes:**
 - **Per-Resource Watermark Architecture**
  - Explain sync tracks progress independently per resource type (location, entry, volume, tag, etc.)
  - Enables surgical recovery: only re-sync resources with detected gaps
  - Prevents cross-contamination: advancing location watermark doesn't affect entry sync
 - **Dual Watermark Strategy**
  - **Cursor watermark**: Advances optimistically with each received record
  - **Validated watermark**: Only advances after count verification passes
  - On gap detection, reset cursor to validated watermark for surgical recovery
 - **Integrity Validation Mechanisms**
  - **Count-based gap detection**: Compare expected vs. actual record counts per resource
  - **Hash-based update detection**: Aggregated hash of resource data catches missed updates
  - Both run during watermark exchange between peers
 - **Escalation Strategy**
  - Normal flow: Incremental catch-up using watermarks
  - After 5 consecutive catch-up failures: Escalate to full backfill
  - Backfill completes → Reset watermarks → Return to incremental mode
 - **Watermark Exchange Protocol**
  - Bidirectional negotiation when devices reconnect
  - Each device sends: watermarks + counts + hashes for all resources
  - Peer responds with: actual counts/hashes + needs_catchup flags
  - Surgical recovery initiated for mismatched resources only
 **Why This Matters:**
 The watermark system is why Spacedrive can efficiently sync massive libraries without full re-indexing after network interruptions. It's a key architectural innovation over naive "send everything" approaches.
 **Remove:** Vague references to "efficient state-based replication" without explaining the mechanism.
 ---
 ### 2. WASM Extension System (Section 4.9.2)
 **Lines:** 2601-2655
 **Problem:** Wire registry integration is incorrect - that's not the current plan. Need to focus on the actual WASM sandbox architecture.
 **Required Changes:**
 - **Remove:** All references to "single host function routing to Wire registry"
 - **Emphasize:** WASM provides security through complete sandboxing
 - **Focus on:** Capability-based permission model
  - Extensions declare required permissions upfront
  - Permissions: ReadEntries, WriteSidecars, UseModel, RegisterModel, DispatchJobs
  - Rate limiting per extension (requests/minute)
 - **Memory Systems for AI Agents**
  - **TemporalMemory**: Time-ordered event stream, supports `since()` queries
  - **AssociativeMemory**: Semantic similarity search, similarity threshold filtering
  - **WorkingMemory**: Current state and active plans
  - Agents maintain persistent knowledge across restarts
 - **Event-Driven Architecture**
  - `#[on_startup]`: Initialization hook
  - `#[on_event(EntryCreated)]`: React to filesystem events
  - `#[scheduled(cron = "...")]`: Time-based triggers
  - `#[filter("...")]`: Entry filtering expressions
 **Keep ONE SDK Example:**
 Show Photos extension structure to illustrate event-driven agents:
 ```rust
 #[agent]
 impl Photos {
    #[on_event(EntryCreated)]
    #[filter(".extension().is_image()")]
    pub async fn on_new_photo(entry: Entry, ctx: &AgentContext<PhotosMind>);
 }
 ```
 **Why This Matters:**
 The extension architecture enables domain-specific intelligence (Photos, Finance, Organization agents) while maintaining security through sandboxing.
 ---
 ### 3. Indexing Engine Resumability (Section 4.3)
 **Lines:** 738-872
 **Problem:** Describes "multi-phase" abstractly without explaining what makes jobs actually resumable.
 **Required Changes:**
 - **Phase Separation Rationale**
  - Each phase has distinct failure modes and I/O characteristics
  - Discovery: Filesystem traversal (fails on permissions)
  - Processing: Database writes (fails on constraint violations)
  - Aggregation: Hierarchical calculations (fails on corrupted references)
  - Content ID: File hashing (fails on file locks)
 - **Checkpoint Architecture**
  - Jobs checkpoint after each batch (default: 1000 entries)
  - State serialized with MessagePack (compact binary format)
  - On crash/restart: Deserialize state → Resume from last checkpoint
  - Checkpoint includes: phase, batch cursor, processed entry IDs
 - **Resumability Flow**
  1. Job interrupted (crash, user cancel, device offline)
  2. State persisted to `jobs.db` with last completed phase
  3. On restart: Load serialized state from database
  4. Jump to last completed phase, skip processed entries
  5. Continue from checkpoint cursor
 - **Ephemeral Mode Architecture**
  - In-memory Entry records for non-indexed paths
  - Enables browsing external drives without permanent indexing
  - Three use cases:
    - Exploring removable media before adding as Location
    - Remote filesystem browsing (peer device)
    - "Lazy refresh" during directory navigation
 **Why This Matters:**
 Resumability is critical for mobile devices and large libraries where indexing can take hours and may be interrupted multiple times.
 **Enhance Diagram (Fig 4.4):** Add checkpoint persistence arrows and resumability flow.
 ---
 ## Phase 2: Major Architectural Expansions
 ### 4. Content Identity Two-Tier Hashing (Section 4.2)
 **Lines:** 643-735
 **Problem:** Mentions "integrity hash" but doesn't explain when/why it's generated separately.
 **Required Changes:**
 - **Performance vs. Security Trade-off**
  - Initial indexing: Only sampled hash (first 16 chars of BLAKE3)
  - Enables ~100× faster indexing (58KB read vs. full file)
  - Full integrity hash generated lazily by background ValidationJobs
 - **Validation Architecture**
  - ValidationJobs run during idle periods
  - Generate complete BLAKE3 hash of entire file
  - Compare against expected content_id
  - Mismatch detection → Corruption alert + restoration from redundant copies
 - **When Full Integrity Matters**
  - Large file transfers (verify no corruption)
  - Backup verification (ensure bit-perfect copy)
  - Forensic analysis (cryptographic proof of content)
  - Security-sensitive files (detect tampering)
 **Why This Matters:**
 Separating "identity" (for deduplication) from "integrity" (for verification) allows instant indexing while preserving cryptographic guarantees when needed.
 ---
 ### 5. Action System Simulation Details (Section 4.4)
 **Lines:** 945-1236
 **Problem:** Describes preview/commit but not HOW simulation achieves accuracy.
 **Required Changes:**
 - **Index-Based Simulation Architecture**
  - All predictions via SQL queries against VDFS index
  - No filesystem access during preview
  - Complete knowledge: Every file's size, location, relationships known
 - **Content-Aware Path Resolution**
  - For `SdPath::Content` operations, resolver evaluates all instances
  - Cost function weighs:
    - **Locality**: Local device = 0 cost (instant)
    - **Network proximity**: Iroh provides real-time latency measurements
    - **Storage tier**: SSD prioritized over HDD (from PhysicalClass)
    - **Device availability**: Only online devices considered
  - Lowest-cost path selected automatically
 - **Conflict Detection Categories**
  - **Storage constraints**: Calculate exact space requirements, verify availability
  - **Permission violations**: Check write access before committing
  - **Path conflicts**: Detect naming collisions in target directory
  - **Circular references**: Prevent moving parent into descendant
  - **Resource limitations**: Estimate memory/bandwidth vs. device capabilities
 - **Storage Tier Warnings**
  - Simulation detects PhysicalClass/LogicalClass mismatches
  - Example: User marks folder as "Hot" but it's on Cold HDD
  - Preview shows: "Warning: Operation targets hot location on slow archive drive"
 **Why This Matters:**
 The simulation engine prevents data loss and user frustration by catching problems before execution. Its power comes from having a complete index.
 ---
 ### 6. Networking ALPN Multiplexing (Section 4.5.2)
 **Lines:** 1320-1424
 **Problem:** Mentions Iroh but doesn't explain why protocol consolidation matters.
 **Required Changes:**
 - **ALPN Protocol Multiplexing Benefits**
  - Single QUIC connection per device pair
  - Multiple protocols as streams: pairing, sync, file transfer, messaging
  - Each protocol identified by ALPN string (e.g., "spacedrive/sync/1")
  - Stream-level routing, not connection-level
 - **Connection Efficiency Gains**
  - Single TCP/QUIC handshake instead of N handshakes
  - Shared congestion control across all operations
  - Connection reuse eliminates re-establishment overhead
  - Result: Sub-2-second connection establishment
 - **Deterministic Connection Initiation**
  - Only device with lower NodeId initiates outbound connection
  - Prevents race condition: Both devices trying to connect simultaneously
  - Simpler state machine: Each device knows its role
 - **Pairing Security Model**
  - BIP39 mnemonic codes (12 words from 256-bit secret)
  - Challenge-response handshake (4 messages)
  - Ed25519 signatures for authentication
  - Prevents MITM during initial pairing
 **Why This Matters:**
 Treating all protocols as streams on one connection eliminates coordination overhead and connection races in P2P networks.
 ---
 ## Phase 3: Important Context Additions
 ### 7. Ephemeral Mode Use Cases (Section 4.1.2)
 **Lines:** 393-394
 **Problem:** One-sentence mention doesn't convey the architectural significance.
 **Add (1 paragraph):**
 - **Three Ephemeral Scenarios**
  - Browsing external drives before formal indexing
  - Exploring peer device filesystems remotely
  - "Lazy refresh" during directory navigation
 - **Architectural Benefit**: Immediate metadata capability (tagging, organizing) even for unindexed files
 ---
 ### 8. Lightweight Embedding Models (Section 4.7)
 **Lines:** 1797-1948
 **Problem:** Doesn't emphasize these are SMALL models, not LLMs.
 **Clarify:**
 - **Model Scale Reality**
  - all-MiniLM-L6-v2: 22M parameters, 384 dimensions, 5MB model size
  - NOT GPT-scale (billions of parameters)
  - Specialized for semantic similarity, not text generation
 - **Performance Characteristics**
  - Runs efficiently on CPU (no GPU required)
  - Processes thousands of files/second during indexing
  - Real-time query embedding (<40ms)
 **Why This Matters:**
 The architecture is practical BECAUSE it doesn't require massive models or specialized hardware.
 ---
 ### 9. Volume Classification Benefits (Section 4.6)
 **Lines:** 1950-2075
 **Problem:** Describes classification but not why the complexity matters.
 **Add:**
 - **Platform-Specific Chaos**
  - macOS: APFS containers create multiple volumes from one physical drive
  - Linux: Virtual filesystems (/proc, /sys, /dev) clutter mount list
  - Windows: Hidden recovery partitions and system volumes
 - **Auto-Tracking Intelligence**
  - Filter ~10 system volumes → Show ~3 user-relevant volumes
  - Present semantic names: "Primary", "External", "Network"
  - Hide: System, VM, Preboot, Update partitions
 **Why This Matters:**
 Users see cleaned, meaningful volume lists instead of technical chaos. Reduces cognitive load.
 ---
 ### 10. Closure Table Performance (Section 4.8 / Database)
 **Lines:** 938-944
 **Problem:** Mentions closure table but not the performance win.
 **Add:**
 - **Traditional Hierarchical Query Problem**
  - Recursive CTEs: Multiple passes over data
  - LIKE-based path matching: O(n) table scan
  - Performance degrades with tree depth
 - **Closure Table Solution**
  - Pre-computed ancestor-descendant relationships
  - All hierarchy queries → Single indexed join
  - O(1) operations: Directory listing, size calculation, ancestor lookup
 - **Trade-off**
  - Additional storage: O(d × n) where d = tree depth, n = entries
  - Transactional updates: Insert self-closure + inherit parent closures
  - Benefit: Million-file libraries with sub-100ms hierarchy queries
 **Why This Matters:**
 This is why Spacedrive maintains responsiveness with massive libraries while traditional file managers slow down.
 ---
 ## Phase 4: Terminology & Accuracy Corrections
 ### 11. HLC Usage Clarification (Multiple Sections)
 **Problem:** Paper sometimes implies HLC is used for all sync.
 **Global Find/Replace Needed:**
 - Device-owned data (entries, locations, volumes): **Timestamp-based watermarks**
 - Shared data (tags, collections, user metadata): **HLC-based log**
 - Be explicit about which domain uses which mechanism
 - Update Section 4.5.1 sync domain table to clarify
 ---
 ### 12. Testing Framework Detail (Section 7)
 **Lines:** 2366-2415
 **Problem:** Underplays sophistication of distributed testing.
 **Add:**
 - **Subprocess Testing Architecture**
  - Tests spawn multiple Rust processes, each simulating a device
  - Environment variables control device roles (TEST_ROLE=alice)
  - Real P2P communication over loopback
 - **Realistic Scenarios Tested**
  - Full device pairing flows with authentication
  - Conflict detection and resolution
  - Network interruption recovery
  - Cross-device file transfers
 - **Scale**: 43 integration tests validate distributed system behavior that would be impossible with unit tests alone
 **Why This Matters:**
 Validates the distributed system ACTUALLY works, not just individual components in isolation.
 ---
 ## Phase 5: Content Removal & Cleanup
 ### 13. Remove Unnecessary Code Listings
 **Keep ONLY:**
 - SdPath enum (lines 458-474) - core abstraction
 - ONE SDK example for agent event handler (if needed for clarity)
 **Remove:**
 - Rust trait definitions (Job, JobHandler, etc.)
 - SQL schema code
 - File type TOML examples
 - JSON format examples
 - All other implementation snippets
 **Reasoning:** Paper explains architecture, not implementation. Code distracts from concepts.
 ---
 ### 14. Remove Benchmark Claims Outside Indexing
 **Scan for and remove:**
 - "Sub-100ms search" (not benchmarked)
 - "8,500 files/sec" (only indexing is benchmarked)
 - Network throughput numbers (not measured)
 - Any "X% faster" comparisons without data
 **Keep:**
 - Table 4.1 (Indexing benchmark data) - real measurements
 - Generic statements: "sub-second response times" (not specific numbers)
 ---
 ## Phase 6: Diagram Improvements
 ### 15. Sync Architecture Diagram (Section 4.5.1)
 **Current:** Text-heavy explanation
 **Improve:** Visual diagram showing:
 - Two sync domains (device-owned vs. shared)
 - Watermark exchange protocol flow
 - Escalation decision tree (catch-up → backfill)
 ---
 ### 16. Indexing Pipeline Diagram (Section 4.3)
 **Current (Fig 4.4):** Basic phase flow
 **Enhance:**
 - Checkpoint persistence after each phase
 - Resumability arrows showing restart path
 - Ephemeral mode as separate branch
 ---
 ## Document Conventions
 ### When Writing Updates
 1. **Start with "Why"**: Explain the problem being solved
 2. **Architecture over Implementation**: Focus on WHAT and WHY, not HOW
 3. **Be Precise**: Use exact technical terms, avoid vague descriptions
 4. **Cross-Reference**: Link related sections with `\ref{}`
 5. **Diagrams > Prose**: Visualize complex interactions when possible
 ### Review Checklist
 - [ ] No marketing language or superlatives?
 - [ ] No fake statistics or unmeasured performance claims?
 - [ ] No code examples (except SdPath + maybe one SDK example)?
 - [ ] Explains WHY design decisions were made?
 - [ ] Technically accurate and precise?
 - [ ] Consistent terminology with glossary (Appendix)?
 ---
 ## Priority Order for Implementation
 **Week 1: Critical Fixes**
 1. Section 4.5.1 - Library Sync (most architecturally wrong)
 2. Section 4.9.2 - WASM Extensions (remove incorrect Wire info)
 3. Section 4.3 - Indexing Resumability (missing key details)
 **Week 2: Major Expansions**
 4. Section 4.2 - Two-Tier Hashing
 5. Section 4.4 - Simulation Engine
 6. Section 4.5.2 - ALPN Multiplexing
 **Week 3: Polish**
 7-12. Minor additions and terminology fixes
 13-14. Remove unnecessary code and fake benchmarks
 15-16. Improve diagrams
 ---
 ## Notes
 - PhysicalClass/LogicalClass: Keep in paper (simulation engine needs it)
 - Extension Wire integration: Removed from plan (not current architecture)
 - Benchmarks: Only indexing section has real measurements
 - Writing style: Technical precision, no marketing fluff
		`@ -1 +0,0 @@`
			`1dd5b461c2ca70e1d328ece908142f6ab6f531dbfc55e320e0d2d503b0177eed`