mirror of https://github.com/spacedriveapp/spacedrive.git synced 2025-12-11 20:15:30 +01:00

Jamie Pine bd504a721e Add tagging UI and tag system components

2025-11-22 05:12:55 -08:00

18 KiB

Raw Blame History

Spacedrive Whitepaper Revision Plan

Last Updated: 2025-01-21 Purpose: Track architectural updates needed to align the whitepaper with the V2 implementation.

Editorial Guidelines for Updates

Writing Style

Architecture-focused: Explain WHAT systems do and WHY design decisions were made
No code examples: Exception for SdPath enum (core abstraction) and one SDK example if absolutely necessary
No marketing language: Avoid superlatives like "blazing fast", "revolutionary", "game-changing"
No fake statistics: Only cite real benchmarks from the indexing section
No implementation status: Never mention "planned", "in progress", "coming soon" - write as if complete
Technical precision: Use exact terminology, avoid vague descriptions
Clarity over cleverness: Straightforward explanations trump eloquent prose

What NOT to Include

Performance metrics beyond indexing benchmarks (we haven't measured them)
Code listings (except SdPath and possibly one SDK example)
Comparisons claiming "X% faster than Y" without data
Feature timelines or roadmap speculation
Implementation details (how it's coded vs. how it's architected)

Format Consistency

Use \textbf{} for emphasis, not italics in technical sections
Keep Key Takeaways boxes concise (3-4 bullets max)
Diagrams over lengthy prose where possible
Section cross-references using \ref{} consistently

Status Legend

CRITICAL - Architecturally incorrect, must fix
MAJOR - Missing significant architectural details
MINOR - Terminology tweaks or small additions
REMOVE - Content to delete or minimize

Phase 1: Critical Architectural Corrections

1. Library Sync Architecture (Section 4.5.1)

Lines: 1266-1318 Problem: Describes sync too abstractly, missing the sophisticated watermark system that makes it reliable.

Required Changes:

Per-Resource Watermark Architecture
- Explain sync tracks progress independently per resource type (location, entry, volume, tag, etc.)
- Enables surgical recovery: only re-sync resources with detected gaps
- Prevents cross-contamination: advancing location watermark doesn't affect entry sync
Dual Watermark Strategy
- Cursor watermark: Advances optimistically with each received record
- Validated watermark: Only advances after count verification passes
- On gap detection, reset cursor to validated watermark for surgical recovery
Integrity Validation Mechanisms
- Count-based gap detection: Compare expected vs. actual record counts per resource
- Hash-based update detection: Aggregated hash of resource data catches missed updates
- Both run during watermark exchange between peers
Escalation Strategy
- Normal flow: Incremental catch-up using watermarks
- After 5 consecutive catch-up failures: Escalate to full backfill
- Backfill completes → Reset watermarks → Return to incremental mode
Watermark Exchange Protocol
- Bidirectional negotiation when devices reconnect
- Each device sends: watermarks + counts + hashes for all resources
- Peer responds with: actual counts/hashes + needs_catchup flags
- Surgical recovery initiated for mismatched resources only

Why This Matters: The watermark system is why Spacedrive can efficiently sync massive libraries without full re-indexing after network interruptions. It's a key architectural innovation over naive "send everything" approaches.

Remove: Vague references to "efficient state-based replication" without explaining the mechanism.

2. WASM Extension System (Section 4.9.2)

Lines: 2601-2655 Problem: Wire registry integration is incorrect - that's not the current plan. Need to focus on the actual WASM sandbox architecture.

Required Changes:

Remove: All references to "single host function routing to Wire registry"
Emphasize: WASM provides security through complete sandboxing
Focus on: Capability-based permission model
- Extensions declare required permissions upfront
- Permissions: ReadEntries, WriteSidecars, UseModel, RegisterModel, DispatchJobs
- Rate limiting per extension (requests/minute)
Memory Systems for AI Agents
- TemporalMemory: Time-ordered event stream, supports since() queries
- AssociativeMemory: Semantic similarity search, similarity threshold filtering
- WorkingMemory: Current state and active plans
- Agents maintain persistent knowledge across restarts
Event-Driven Architecture
- #[on_startup]: Initialization hook
- #[on_event(EntryCreated)]: React to filesystem events
- #[scheduled(cron = "...")]: Time-based triggers
- #[filter("...")]: Entry filtering expressions

Keep ONE SDK Example: Show Photos extension structure to illustrate event-driven agents:

#[agent]
impl Photos {
    #[on_event(EntryCreated)]
    #[filter(".extension().is_image()")]
    pub async fn on_new_photo(entry: Entry, ctx: &AgentContext<PhotosMind>);
}

Why This Matters: The extension architecture enables domain-specific intelligence (Photos, Finance, Organization agents) while maintaining security through sandboxing.

3. Indexing Engine Resumability (Section 4.3)

Lines: 738-872 Problem: Describes "multi-phase" abstractly without explaining what makes jobs actually resumable.

Required Changes:

Phase Separation Rationale
- Each phase has distinct failure modes and I/O characteristics
- Discovery: Filesystem traversal (fails on permissions)
- Processing: Database writes (fails on constraint violations)
- Aggregation: Hierarchical calculations (fails on corrupted references)
- Content ID: File hashing (fails on file locks)
Checkpoint Architecture
- Jobs checkpoint after each batch (default: 1000 entries)
- State serialized with MessagePack (compact binary format)
- On crash/restart: Deserialize state → Resume from last checkpoint
- Checkpoint includes: phase, batch cursor, processed entry IDs
Resumability Flow
1. Job interrupted (crash, user cancel, device offline)
2. State persisted to jobs.db with last completed phase
3. On restart: Load serialized state from database
4. Jump to last completed phase, skip processed entries
5. Continue from checkpoint cursor
Ephemeral Mode Architecture
- In-memory Entry records for non-indexed paths
- Enables browsing external drives without permanent indexing
- Three use cases:
  - Exploring removable media before adding as Location
  - Remote filesystem browsing (peer device)
  - "Lazy refresh" during directory navigation

Why This Matters: Resumability is critical for mobile devices and large libraries where indexing can take hours and may be interrupted multiple times.

Enhance Diagram (Fig 4.4): Add checkpoint persistence arrows and resumability flow.

Phase 2: Major Architectural Expansions

4. Content Identity Two-Tier Hashing (Section 4.2)

Lines: 643-735 Problem: Mentions "integrity hash" but doesn't explain when/why it's generated separately.

Required Changes:

Performance vs. Security Trade-off
- Initial indexing: Only sampled hash (first 16 chars of BLAKE3)
- Enables ~100× faster indexing (58KB read vs. full file)
- Full integrity hash generated lazily by background ValidationJobs
Validation Architecture
- ValidationJobs run during idle periods
- Generate complete BLAKE3 hash of entire file
- Compare against expected content_id
- Mismatch detection → Corruption alert + restoration from redundant copies
When Full Integrity Matters
- Large file transfers (verify no corruption)
- Backup verification (ensure bit-perfect copy)
- Forensic analysis (cryptographic proof of content)
- Security-sensitive files (detect tampering)

Why This Matters: Separating "identity" (for deduplication) from "integrity" (for verification) allows instant indexing while preserving cryptographic guarantees when needed.

5. Action System Simulation Details (Section 4.4)

Lines: 945-1236 Problem: Describes preview/commit but not HOW simulation achieves accuracy.

Required Changes:

Index-Based Simulation Architecture
- All predictions via SQL queries against VDFS index
- No filesystem access during preview
- Complete knowledge: Every file's size, location, relationships known
Content-Aware Path Resolution
- For SdPath::Content operations, resolver evaluates all instances
- Cost function weighs:
  - Locality: Local device = 0 cost (instant)
  - Network proximity: Iroh provides real-time latency measurements
  - Storage tier: SSD prioritized over HDD (from PhysicalClass)
  - Device availability: Only online devices considered
- Lowest-cost path selected automatically
Conflict Detection Categories
- Storage constraints: Calculate exact space requirements, verify availability
- Permission violations: Check write access before committing
- Path conflicts: Detect naming collisions in target directory
- Circular references: Prevent moving parent into descendant
- Resource limitations: Estimate memory/bandwidth vs. device capabilities
Storage Tier Warnings
- Simulation detects PhysicalClass/LogicalClass mismatches
- Example: User marks folder as "Hot" but it's on Cold HDD
- Preview shows: "Warning: Operation targets hot location on slow archive drive"

Why This Matters: The simulation engine prevents data loss and user frustration by catching problems before execution. Its power comes from having a complete index.

6. Networking ALPN Multiplexing (Section 4.5.2)

Lines: 1320-1424 Problem: Mentions Iroh but doesn't explain why protocol consolidation matters.

Required Changes:

ALPN Protocol Multiplexing Benefits
- Single QUIC connection per device pair
- Multiple protocols as streams: pairing, sync, file transfer, messaging
- Each protocol identified by ALPN string (e.g., "spacedrive/sync/1")
- Stream-level routing, not connection-level
Connection Efficiency Gains
- Single TCP/QUIC handshake instead of N handshakes
- Shared congestion control across all operations
- Connection reuse eliminates re-establishment overhead
- Result: Sub-2-second connection establishment
Deterministic Connection Initiation
- Only device with lower NodeId initiates outbound connection
- Prevents race condition: Both devices trying to connect simultaneously
- Simpler state machine: Each device knows its role
Pairing Security Model
- BIP39 mnemonic codes (12 words from 256-bit secret)
- Challenge-response handshake (4 messages)
- Ed25519 signatures for authentication
- Prevents MITM during initial pairing

Why This Matters: Treating all protocols as streams on one connection eliminates coordination overhead and connection races in P2P networks.

Phase 3: Important Context Additions

7. Ephemeral Mode Use Cases (Section 4.1.2)

Lines: 393-394 Problem: One-sentence mention doesn't convey the architectural significance.

Add (1 paragraph):

Three Ephemeral Scenarios
- Browsing external drives before formal indexing
- Exploring peer device filesystems remotely
- "Lazy refresh" during directory navigation
Architectural Benefit: Immediate metadata capability (tagging, organizing) even for unindexed files

8. Lightweight Embedding Models (Section 4.7)

Lines: 1797-1948 Problem: Doesn't emphasize these are SMALL models, not LLMs.

Clarify:

Model Scale Reality
- all-MiniLM-L6-v2: 22M parameters, 384 dimensions, 5MB model size
- NOT GPT-scale (billions of parameters)
- Specialized for semantic similarity, not text generation
Performance Characteristics
- Runs efficiently on CPU (no GPU required)
- Processes thousands of files/second during indexing
- Real-time query embedding (<40ms)

Why This Matters: The architecture is practical BECAUSE it doesn't require massive models or specialized hardware.

9. Volume Classification Benefits (Section 4.6)

Lines: 1950-2075 Problem: Describes classification but not why the complexity matters.

Add:

Platform-Specific Chaos
- macOS: APFS containers create multiple volumes from one physical drive
- Linux: Virtual filesystems (/proc, /sys, /dev) clutter mount list
- Windows: Hidden recovery partitions and system volumes
Auto-Tracking Intelligence
- Filter ~10 system volumes → Show ~3 user-relevant volumes
- Present semantic names: "Primary", "External", "Network"
- Hide: System, VM, Preboot, Update partitions

Why This Matters: Users see cleaned, meaningful volume lists instead of technical chaos. Reduces cognitive load.

10. Closure Table Performance (Section 4.8 / Database)

Lines: 938-944 Problem: Mentions closure table but not the performance win.

Add:

Traditional Hierarchical Query Problem
- Recursive CTEs: Multiple passes over data
- LIKE-based path matching: O(n) table scan
- Performance degrades with tree depth
Closure Table Solution
- Pre-computed ancestor-descendant relationships
- All hierarchy queries → Single indexed join
- O(1) operations: Directory listing, size calculation, ancestor lookup
Trade-off
- Additional storage: O(d × n) where d = tree depth, n = entries
- Transactional updates: Insert self-closure + inherit parent closures
- Benefit: Million-file libraries with sub-100ms hierarchy queries

Why This Matters: This is why Spacedrive maintains responsiveness with massive libraries while traditional file managers slow down.

Phase 4: Terminology & Accuracy Corrections

11. HLC Usage Clarification (Multiple Sections)

Problem: Paper sometimes implies HLC is used for all sync.

Global Find/Replace Needed:

Device-owned data (entries, locations, volumes): Timestamp-based watermarks
Shared data (tags, collections, user metadata): HLC-based log
Be explicit about which domain uses which mechanism
Update Section 4.5.1 sync domain table to clarify

12. Testing Framework Detail (Section 7)

Lines: 2366-2415 Problem: Underplays sophistication of distributed testing.

Add:

Subprocess Testing Architecture
- Tests spawn multiple Rust processes, each simulating a device
- Environment variables control device roles (TEST_ROLE=alice)
- Real P2P communication over loopback
Realistic Scenarios Tested
- Full device pairing flows with authentication
- Conflict detection and resolution
- Network interruption recovery
- Cross-device file transfers
Scale: 43 integration tests validate distributed system behavior that would be impossible with unit tests alone

Why This Matters: Validates the distributed system ACTUALLY works, not just individual components in isolation.

Phase 5: Content Removal & Cleanup

13. Remove Unnecessary Code Listings

Keep ONLY:

SdPath enum (lines 458-474) - core abstraction
ONE SDK example for agent event handler (if needed for clarity)

Remove:

Rust trait definitions (Job, JobHandler, etc.)
SQL schema code
File type TOML examples
JSON format examples
All other implementation snippets

Reasoning: Paper explains architecture, not implementation. Code distracts from concepts.

14. Remove Benchmark Claims Outside Indexing

Scan for and remove:

"Sub-100ms search" (not benchmarked)
"8,500 files/sec" (only indexing is benchmarked)
Network throughput numbers (not measured)
Any "X% faster" comparisons without data

Keep:

Table 4.1 (Indexing benchmark data) - real measurements
Generic statements: "sub-second response times" (not specific numbers)

Phase 6: Diagram Improvements

15. Sync Architecture Diagram (Section 4.5.1)

Current: Text-heavy explanation Improve: Visual diagram showing:

Two sync domains (device-owned vs. shared)
Watermark exchange protocol flow
Escalation decision tree (catch-up → backfill)

16. Indexing Pipeline Diagram (Section 4.3)

Current (Fig 4.4): Basic phase flow Enhance:

Checkpoint persistence after each phase
Resumability arrows showing restart path
Ephemeral mode as separate branch

Document Conventions

When Writing Updates

Start with "Why": Explain the problem being solved
Architecture over Implementation: Focus on WHAT and WHY, not HOW
Be Precise: Use exact technical terms, avoid vague descriptions
Cross-Reference: Link related sections with \ref{}
Diagrams > Prose: Visualize complex interactions when possible

Review Checklist

No marketing language or superlatives?
No fake statistics or unmeasured performance claims?
No code examples (except SdPath + maybe one SDK example)?
Explains WHY design decisions were made?
Technically accurate and precise?
Consistent terminology with glossary (Appendix)?

Priority Order for Implementation

Week 1: Critical Fixes

Section 4.5.1 - Library Sync (most architecturally wrong)
Section 4.9.2 - WASM Extensions (remove incorrect Wire info)
Section 4.3 - Indexing Resumability (missing key details)

Week 2: Major Expansions 4. Section 4.2 - Two-Tier Hashing 5. Section 4.4 - Simulation Engine 6. Section 4.5.2 - ALPN Multiplexing

Week 3: Polish 7-12. Minor additions and terminology fixes 13-14. Remove unnecessary code and fake benchmarks 15-16. Improve diagrams

Notes

PhysicalClass/LogicalClass: Keep in paper (simulation engine needs it)
Extension Wire integration: Removed from plan (not current architecture)
Benchmarks: Only indexing section has real measurements
Writing style: Technical precision, no marketing fluff

18 KiB Raw Blame History Unescape Escape

Spacedrive Whitepaper Revision Plan

Editorial Guidelines for Updates

Writing Style

What NOT to Include

Format Consistency

Status Legend

Phase 1: Critical Architectural Corrections

1. Library Sync Architecture (Section 4.5.1)

2. WASM Extension System (Section 4.9.2)

3. Indexing Engine Resumability (Section 4.3)

Phase 2: Major Architectural Expansions

4. Content Identity Two-Tier Hashing (Section 4.2)

5. Action System Simulation Details (Section 4.4)

6. Networking ALPN Multiplexing (Section 4.5.2)

Phase 3: Important Context Additions

7. Ephemeral Mode Use Cases (Section 4.1.2)

8. Lightweight Embedding Models (Section 4.7)

9. Volume Classification Benefits (Section 4.6)

10. Closure Table Performance (Section 4.8 / Database)

Phase 4: Terminology & Accuracy Corrections

11. HLC Usage Clarification (Multiple Sections)

12. Testing Framework Detail (Section 7)

Phase 5: Content Removal & Cleanup

13. Remove Unnecessary Code Listings

14. Remove Benchmark Claims Outside Indexing

Phase 6: Diagram Improvements

15. Sync Architecture Diagram (Section 4.5.1)

16. Indexing Pipeline Diagram (Section 4.3)

Document Conventions

When Writing Updates

Review Checklist

Priority Order for Implementation

Notes

18 KiB

Raw Blame History