From 73befe0f0f5bd3016956f59dbd49593878376784 Mon Sep 17 00:00:00 2001 From: Jamie Pine Date: Mon, 1 Dec 2025 13:30:30 -0800 Subject: [PATCH] cleanup --- README.md | 2 +- core/crush.json | 18 -- core/master_key | 1 - whitepaper/REVISION_PLAN.md | 471 ------------------------------------ 4 files changed, 1 insertion(+), 491 deletions(-) delete mode 100644 core/crush.json delete mode 100644 core/master_key delete mode 100644 whitepaper/REVISION_PLAN.md diff --git a/README.md b/README.md index 586b98417..7de04c970 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ Organize files across multiple devices, clouds, and platforms from a single inte --- -> [!NOTE] > **Hi, Jamie here!** This is Spacedrive v2 (December 2025)—a complete ground-up rewrite. +> [!NOTE] **Hi, Jamie here!** This is Spacedrive v2 (December 2025)—a complete ground-up rewrite. > > After development of the original alpha version stopped in January this year, left with the hard lessons of the incomplete alpha, I set out to completely rebuild Spacedrive from the ground up. The first release **2.0.0-pre.1** is coming before Christmas. > diff --git a/core/crush.json b/core/crush.json deleted file mode 100644 index 170f0e769..000000000 --- a/core/crush.json +++ /dev/null @@ -1,18 +0,0 @@ -{ - "$schema": "https://charm.land/crush.json", - "providers": { - "lmstudio": { - "name": "LM Studio", - "base_url": "http://localhost:1234/v1/", - "type": "openai", - "models": [ - { - "name": "Qwen3 30B MOE", - "id": "local-model", - "context_window": 131072, - "default_max_tokens": 20000 - } - ] - } - } -} \ No newline at end of file diff --git a/core/master_key b/core/master_key deleted file mode 100644 index 06d0163f5..000000000 --- a/core/master_key +++ /dev/null @@ -1 +0,0 @@ -1dd5b461c2ca70e1d328ece908142f6ab6f531dbfc55e320e0d2d503b0177eed \ No newline at end of file diff --git a/whitepaper/REVISION_PLAN.md b/whitepaper/REVISION_PLAN.md deleted file mode 100644 index aab056bba..000000000 --- a/whitepaper/REVISION_PLAN.md +++ /dev/null @@ -1,471 +0,0 @@ -# Spacedrive Whitepaper Revision Plan - -**Last Updated:** 2025-01-21 -**Purpose:** Track architectural updates needed to align the whitepaper with the V2 implementation. - ---- - -## Editorial Guidelines for Updates - -### Writing Style -- **Architecture-focused**: Explain WHAT systems do and WHY design decisions were made -- **No code examples**: Exception for SdPath enum (core abstraction) and one SDK example if absolutely necessary -- **No marketing language**: Avoid superlatives like "blazing fast", "revolutionary", "game-changing" -- **No fake statistics**: Only cite real benchmarks from the indexing section -- **No implementation status**: Never mention "planned", "in progress", "coming soon" - write as if complete -- **Technical precision**: Use exact terminology, avoid vague descriptions -- **Clarity over cleverness**: Straightforward explanations trump eloquent prose - -### What NOT to Include -- Performance metrics beyond indexing benchmarks (we haven't measured them) -- Code listings (except SdPath and possibly one SDK example) -- Comparisons claiming "X% faster than Y" without data -- Feature timelines or roadmap speculation -- Implementation details (how it's coded vs. how it's architected) - -### Format Consistency -- Use `\textbf{}` for emphasis, not italics in technical sections -- Keep Key Takeaways boxes concise (3-4 bullets max) -- Diagrams over lengthy prose where possible -- Section cross-references using `\ref{}` consistently - ---- - -## Status Legend -- **CRITICAL** - Architecturally incorrect, must fix -- **MAJOR** - Missing significant architectural details -- **MINOR** - Terminology tweaks or small additions -- **REMOVE** - Content to delete or minimize - ---- - -## Phase 1: Critical Architectural Corrections - -### 1. Library Sync Architecture (Section 4.5.1) -**Lines:** 1266-1318 -**Problem:** Describes sync too abstractly, missing the sophisticated watermark system that makes it reliable. - -**Required Changes:** -- **Per-Resource Watermark Architecture** - - Explain sync tracks progress independently per resource type (location, entry, volume, tag, etc.) - - Enables surgical recovery: only re-sync resources with detected gaps - - Prevents cross-contamination: advancing location watermark doesn't affect entry sync - -- **Dual Watermark Strategy** - - **Cursor watermark**: Advances optimistically with each received record - - **Validated watermark**: Only advances after count verification passes - - On gap detection, reset cursor to validated watermark for surgical recovery - -- **Integrity Validation Mechanisms** - - **Count-based gap detection**: Compare expected vs. actual record counts per resource - - **Hash-based update detection**: Aggregated hash of resource data catches missed updates - - Both run during watermark exchange between peers - -- **Escalation Strategy** - - Normal flow: Incremental catch-up using watermarks - - After 5 consecutive catch-up failures: Escalate to full backfill - - Backfill completes → Reset watermarks → Return to incremental mode - -- **Watermark Exchange Protocol** - - Bidirectional negotiation when devices reconnect - - Each device sends: watermarks + counts + hashes for all resources - - Peer responds with: actual counts/hashes + needs_catchup flags - - Surgical recovery initiated for mismatched resources only - -**Why This Matters:** -The watermark system is why Spacedrive can efficiently sync massive libraries without full re-indexing after network interruptions. It's a key architectural innovation over naive "send everything" approaches. - -**Remove:** Vague references to "efficient state-based replication" without explaining the mechanism. - ---- - -### 2. WASM Extension System (Section 4.9.2) -**Lines:** 2601-2655 -**Problem:** Wire registry integration is incorrect - that's not the current plan. Need to focus on the actual WASM sandbox architecture. - -**Required Changes:** -- **Remove:** All references to "single host function routing to Wire registry" -- **Emphasize:** WASM provides security through complete sandboxing -- **Focus on:** Capability-based permission model - - Extensions declare required permissions upfront - - Permissions: ReadEntries, WriteSidecars, UseModel, RegisterModel, DispatchJobs - - Rate limiting per extension (requests/minute) - -- **Memory Systems for AI Agents** - - **TemporalMemory**: Time-ordered event stream, supports `since()` queries - - **AssociativeMemory**: Semantic similarity search, similarity threshold filtering - - **WorkingMemory**: Current state and active plans - - Agents maintain persistent knowledge across restarts - -- **Event-Driven Architecture** - - `#[on_startup]`: Initialization hook - - `#[on_event(EntryCreated)]`: React to filesystem events - - `#[scheduled(cron = "...")]`: Time-based triggers - - `#[filter("...")]`: Entry filtering expressions - -**Keep ONE SDK Example:** -Show Photos extension structure to illustrate event-driven agents: -```rust -#[agent] -impl Photos { - #[on_event(EntryCreated)] - #[filter(".extension().is_image()")] - pub async fn on_new_photo(entry: Entry, ctx: &AgentContext); -} -``` - -**Why This Matters:** -The extension architecture enables domain-specific intelligence (Photos, Finance, Organization agents) while maintaining security through sandboxing. - ---- - -### 3. Indexing Engine Resumability (Section 4.3) -**Lines:** 738-872 -**Problem:** Describes "multi-phase" abstractly without explaining what makes jobs actually resumable. - -**Required Changes:** -- **Phase Separation Rationale** - - Each phase has distinct failure modes and I/O characteristics - - Discovery: Filesystem traversal (fails on permissions) - - Processing: Database writes (fails on constraint violations) - - Aggregation: Hierarchical calculations (fails on corrupted references) - - Content ID: File hashing (fails on file locks) - -- **Checkpoint Architecture** - - Jobs checkpoint after each batch (default: 1000 entries) - - State serialized with MessagePack (compact binary format) - - On crash/restart: Deserialize state → Resume from last checkpoint - - Checkpoint includes: phase, batch cursor, processed entry IDs - -- **Resumability Flow** - 1. Job interrupted (crash, user cancel, device offline) - 2. State persisted to `jobs.db` with last completed phase - 3. On restart: Load serialized state from database - 4. Jump to last completed phase, skip processed entries - 5. Continue from checkpoint cursor - -- **Ephemeral Mode Architecture** - - In-memory Entry records for non-indexed paths - - Enables browsing external drives without permanent indexing - - Three use cases: - - Exploring removable media before adding as Location - - Remote filesystem browsing (peer device) - - "Lazy refresh" during directory navigation - -**Why This Matters:** -Resumability is critical for mobile devices and large libraries where indexing can take hours and may be interrupted multiple times. - -**Enhance Diagram (Fig 4.4):** Add checkpoint persistence arrows and resumability flow. - ---- - -## Phase 2: Major Architectural Expansions - -### 4. Content Identity Two-Tier Hashing (Section 4.2) -**Lines:** 643-735 -**Problem:** Mentions "integrity hash" but doesn't explain when/why it's generated separately. - -**Required Changes:** -- **Performance vs. Security Trade-off** - - Initial indexing: Only sampled hash (first 16 chars of BLAKE3) - - Enables ~100× faster indexing (58KB read vs. full file) - - Full integrity hash generated lazily by background ValidationJobs - -- **Validation Architecture** - - ValidationJobs run during idle periods - - Generate complete BLAKE3 hash of entire file - - Compare against expected content_id - - Mismatch detection → Corruption alert + restoration from redundant copies - -- **When Full Integrity Matters** - - Large file transfers (verify no corruption) - - Backup verification (ensure bit-perfect copy) - - Forensic analysis (cryptographic proof of content) - - Security-sensitive files (detect tampering) - -**Why This Matters:** -Separating "identity" (for deduplication) from "integrity" (for verification) allows instant indexing while preserving cryptographic guarantees when needed. - ---- - -### 5. Action System Simulation Details (Section 4.4) -**Lines:** 945-1236 -**Problem:** Describes preview/commit but not HOW simulation achieves accuracy. - -**Required Changes:** -- **Index-Based Simulation Architecture** - - All predictions via SQL queries against VDFS index - - No filesystem access during preview - - Complete knowledge: Every file's size, location, relationships known - -- **Content-Aware Path Resolution** - - For `SdPath::Content` operations, resolver evaluates all instances - - Cost function weighs: - - **Locality**: Local device = 0 cost (instant) - - **Network proximity**: Iroh provides real-time latency measurements - - **Storage tier**: SSD prioritized over HDD (from PhysicalClass) - - **Device availability**: Only online devices considered - - Lowest-cost path selected automatically - -- **Conflict Detection Categories** - - **Storage constraints**: Calculate exact space requirements, verify availability - - **Permission violations**: Check write access before committing - - **Path conflicts**: Detect naming collisions in target directory - - **Circular references**: Prevent moving parent into descendant - - **Resource limitations**: Estimate memory/bandwidth vs. device capabilities - -- **Storage Tier Warnings** - - Simulation detects PhysicalClass/LogicalClass mismatches - - Example: User marks folder as "Hot" but it's on Cold HDD - - Preview shows: "Warning: Operation targets hot location on slow archive drive" - -**Why This Matters:** -The simulation engine prevents data loss and user frustration by catching problems before execution. Its power comes from having a complete index. - ---- - -### 6. Networking ALPN Multiplexing (Section 4.5.2) -**Lines:** 1320-1424 -**Problem:** Mentions Iroh but doesn't explain why protocol consolidation matters. - -**Required Changes:** -- **ALPN Protocol Multiplexing Benefits** - - Single QUIC connection per device pair - - Multiple protocols as streams: pairing, sync, file transfer, messaging - - Each protocol identified by ALPN string (e.g., "spacedrive/sync/1") - - Stream-level routing, not connection-level - -- **Connection Efficiency Gains** - - Single TCP/QUIC handshake instead of N handshakes - - Shared congestion control across all operations - - Connection reuse eliminates re-establishment overhead - - Result: Sub-2-second connection establishment - -- **Deterministic Connection Initiation** - - Only device with lower NodeId initiates outbound connection - - Prevents race condition: Both devices trying to connect simultaneously - - Simpler state machine: Each device knows its role - -- **Pairing Security Model** - - BIP39 mnemonic codes (12 words from 256-bit secret) - - Challenge-response handshake (4 messages) - - Ed25519 signatures for authentication - - Prevents MITM during initial pairing - -**Why This Matters:** -Treating all protocols as streams on one connection eliminates coordination overhead and connection races in P2P networks. - ---- - -## Phase 3: Important Context Additions - -### 7. Ephemeral Mode Use Cases (Section 4.1.2) -**Lines:** 393-394 -**Problem:** One-sentence mention doesn't convey the architectural significance. - -**Add (1 paragraph):** -- **Three Ephemeral Scenarios** - - Browsing external drives before formal indexing - - Exploring peer device filesystems remotely - - "Lazy refresh" during directory navigation -- **Architectural Benefit**: Immediate metadata capability (tagging, organizing) even for unindexed files - ---- - -### 8. Lightweight Embedding Models (Section 4.7) -**Lines:** 1797-1948 -**Problem:** Doesn't emphasize these are SMALL models, not LLMs. - -**Clarify:** -- **Model Scale Reality** - - all-MiniLM-L6-v2: 22M parameters, 384 dimensions, 5MB model size - - NOT GPT-scale (billions of parameters) - - Specialized for semantic similarity, not text generation - -- **Performance Characteristics** - - Runs efficiently on CPU (no GPU required) - - Processes thousands of files/second during indexing - - Real-time query embedding (<40ms) - -**Why This Matters:** -The architecture is practical BECAUSE it doesn't require massive models or specialized hardware. - ---- - -### 9. Volume Classification Benefits (Section 4.6) -**Lines:** 1950-2075 -**Problem:** Describes classification but not why the complexity matters. - -**Add:** -- **Platform-Specific Chaos** - - macOS: APFS containers create multiple volumes from one physical drive - - Linux: Virtual filesystems (/proc, /sys, /dev) clutter mount list - - Windows: Hidden recovery partitions and system volumes - -- **Auto-Tracking Intelligence** - - Filter ~10 system volumes → Show ~3 user-relevant volumes - - Present semantic names: "Primary", "External", "Network" - - Hide: System, VM, Preboot, Update partitions - -**Why This Matters:** -Users see cleaned, meaningful volume lists instead of technical chaos. Reduces cognitive load. - ---- - -### 10. Closure Table Performance (Section 4.8 / Database) -**Lines:** 938-944 -**Problem:** Mentions closure table but not the performance win. - -**Add:** -- **Traditional Hierarchical Query Problem** - - Recursive CTEs: Multiple passes over data - - LIKE-based path matching: O(n) table scan - - Performance degrades with tree depth - -- **Closure Table Solution** - - Pre-computed ancestor-descendant relationships - - All hierarchy queries → Single indexed join - - O(1) operations: Directory listing, size calculation, ancestor lookup - -- **Trade-off** - - Additional storage: O(d × n) where d = tree depth, n = entries - - Transactional updates: Insert self-closure + inherit parent closures - - Benefit: Million-file libraries with sub-100ms hierarchy queries - -**Why This Matters:** -This is why Spacedrive maintains responsiveness with massive libraries while traditional file managers slow down. - ---- - -## Phase 4: Terminology & Accuracy Corrections - -### 11. HLC Usage Clarification (Multiple Sections) -**Problem:** Paper sometimes implies HLC is used for all sync. - -**Global Find/Replace Needed:** -- Device-owned data (entries, locations, volumes): **Timestamp-based watermarks** -- Shared data (tags, collections, user metadata): **HLC-based log** -- Be explicit about which domain uses which mechanism -- Update Section 4.5.1 sync domain table to clarify - ---- - -### 12. Testing Framework Detail (Section 7) -**Lines:** 2366-2415 -**Problem:** Underplays sophistication of distributed testing. - -**Add:** -- **Subprocess Testing Architecture** - - Tests spawn multiple Rust processes, each simulating a device - - Environment variables control device roles (TEST_ROLE=alice) - - Real P2P communication over loopback - -- **Realistic Scenarios Tested** - - Full device pairing flows with authentication - - Conflict detection and resolution - - Network interruption recovery - - Cross-device file transfers - -- **Scale**: 43 integration tests validate distributed system behavior that would be impossible with unit tests alone - -**Why This Matters:** -Validates the distributed system ACTUALLY works, not just individual components in isolation. - ---- - -## Phase 5: Content Removal & Cleanup - -### 13. Remove Unnecessary Code Listings - -**Keep ONLY:** -- SdPath enum (lines 458-474) - core abstraction -- ONE SDK example for agent event handler (if needed for clarity) - -**Remove:** -- Rust trait definitions (Job, JobHandler, etc.) -- SQL schema code -- File type TOML examples -- JSON format examples -- All other implementation snippets - -**Reasoning:** Paper explains architecture, not implementation. Code distracts from concepts. - ---- - -### 14. Remove Benchmark Claims Outside Indexing - -**Scan for and remove:** -- "Sub-100ms search" (not benchmarked) -- "8,500 files/sec" (only indexing is benchmarked) -- Network throughput numbers (not measured) -- Any "X% faster" comparisons without data - -**Keep:** -- Table 4.1 (Indexing benchmark data) - real measurements -- Generic statements: "sub-second response times" (not specific numbers) - ---- - -## Phase 6: Diagram Improvements - -### 15. Sync Architecture Diagram (Section 4.5.1) -**Current:** Text-heavy explanation -**Improve:** Visual diagram showing: -- Two sync domains (device-owned vs. shared) -- Watermark exchange protocol flow -- Escalation decision tree (catch-up → backfill) - ---- - -### 16. Indexing Pipeline Diagram (Section 4.3) -**Current (Fig 4.4):** Basic phase flow -**Enhance:** -- Checkpoint persistence after each phase -- Resumability arrows showing restart path -- Ephemeral mode as separate branch - ---- - -## Document Conventions - -### When Writing Updates -1. **Start with "Why"**: Explain the problem being solved -2. **Architecture over Implementation**: Focus on WHAT and WHY, not HOW -3. **Be Precise**: Use exact technical terms, avoid vague descriptions -4. **Cross-Reference**: Link related sections with `\ref{}` -5. **Diagrams > Prose**: Visualize complex interactions when possible - -### Review Checklist -- [ ] No marketing language or superlatives? -- [ ] No fake statistics or unmeasured performance claims? -- [ ] No code examples (except SdPath + maybe one SDK example)? -- [ ] Explains WHY design decisions were made? -- [ ] Technically accurate and precise? -- [ ] Consistent terminology with glossary (Appendix)? - ---- - -## Priority Order for Implementation - -**Week 1: Critical Fixes** -1. Section 4.5.1 - Library Sync (most architecturally wrong) -2. Section 4.9.2 - WASM Extensions (remove incorrect Wire info) -3. Section 4.3 - Indexing Resumability (missing key details) - -**Week 2: Major Expansions** -4. Section 4.2 - Two-Tier Hashing -5. Section 4.4 - Simulation Engine -6. Section 4.5.2 - ALPN Multiplexing - -**Week 3: Polish** -7-12. Minor additions and terminology fixes -13-14. Remove unnecessary code and fake benchmarks -15-16. Improve diagrams - ---- - -## Notes -- PhysicalClass/LogicalClass: Keep in paper (simulation engine needs it) -- Extension Wire integration: Removed from plan (not current architecture) -- Benchmarks: Only indexing section has real measurements -- Writing style: Technical precision, no marketing fluff