mirror of https://github.com/spacedriveapp/spacedrive.git synced 2025-12-11 20:15:30 +01:00

Jamie Pine b330807c0e Implement FS Event Pipeline Testing Guide and update Cargo.toml for CLI name change

- Added a comprehensive testing guide for the FS Event Pipeline, detailing metrics collection, logging, and troubleshooting steps.
- Updated the CLI package name from "spacedrive-cli" to "sd-cli" in Cargo.toml for consistency.
- Modified various files to reflect the new package structure and improve logging and metrics handling.

Co-authored-by: ijamespine <ijamespine@me.com>

2025-09-18 15:31:25 -07:00

18 KiB

Raw Blame History

Spacedrive Indexing System

Overview

The Spacedrive indexing system is a sophisticated, multi-phase file indexing engine designed for high performance and reliability. It discovers, processes, and categorizes files while supporting incremental updates, change detection, and content-based deduplication. The system now supports multiple indexing scopes and ephemeral modes for different use cases.

Architecture

Core Components

IndexerJob - The main job orchestrator that manages the indexing process
IndexerState - Maintains state across phases for resumability
EntryProcessor - Handles database operations for file entries
FileTypeRegistry - Identifies file types through extensions and magic bytes
CasGenerator - Creates content-addressed storage identifiers

Key Features

Multi-phase Processing: Discovery → Processing → Aggregation → Content Identification
Resumable Operations: Jobs can be paused and resumed from checkpoints
Change Detection: Efficiently identifies modified files using inode tracking
Content Deduplication: Uses CAS (Content-Addressed Storage) IDs
Type Detection: Sophisticated file type identification with MIME type support
Performance Optimized: Batch processing, caching, and parallel operations
Flexible Scoping: Current (single-level) vs Recursive (full tree) indexing
Ephemeral Mode: In-memory indexing for browsing external paths
Persistence Options: Database storage vs memory-only for different use cases

Indexing Phases

1. Discovery Phase

Walks the file system to discover all files and directories.

// Key operations:
- Recursive directory traversal
- Filter application (skip system files, hidden files based on rules)
- Batch collection for efficient processing
- Progress tracking and reporting

Output: Batches of DirEntry items ready for processing

2. Processing Phase

Creates or updates database entries for discovered items.

// Key operations:
- Change detection using inode/modified time
- Materialized path storage (no parent_id needed)
- Entry creation/update in database
- Direct path storage for efficient queries

Output: Database entries with proper relationships

3. Aggregation Phase

Calculates aggregate statistics for directories.

// Key operations:
- Bottom-up traversal of directory tree
- Calculate total sizes and file counts
- Update directory entries with aggregate data

Output: Directories with accurate size/count statistics

4. Content Identification Phase

Generates content identifiers and detects file types.

// Key operations:
- CAS ID generation (sampled hashing)
- File type detection (extension + magic bytes)
- MIME type identification
- Content deduplication

Output: Content identities linked to entries

Indexing Scopes and Persistence

Index Scopes

The indexing system supports two different scopes for different use cases:

Current Scope

Description: Index only the specified directory (single level)
Use Cases: UI navigation, quick directory browsing, instant feedback
Performance: <500ms for typical directories
Implementation: Direct directory read without recursion

let config = IndexerJobConfig::ui_navigation(location_id, path);
// Results in single-level scan optimized for UI responsiveness

Recursive Scope

Description: Index the directory and all subdirectories
Use Cases: Full location indexing, comprehensive file discovery
Performance: Depends on directory tree size
Implementation: Traditional recursive tree traversal

let config = IndexerJobConfig::new(location_id, path, mode);
// Default recursive behavior for complete coverage

Persistence Modes

Persistent Mode

Storage: Database (SQLite/PostgreSQL)
Use Cases: Managed locations, permanent indexing
Features: Full change detection, resumability, sync support
Lifecycle: Permanent until explicitly removed

let config = IndexerJobConfig::new(location_id, path, mode);
config.persistence = IndexPersistence::Persistent;

Ephemeral Mode

Storage: Memory (EphemeralIndex)
Use Cases: External path browsing, temporary exploration
Features: No database writes, session-based caching
Lifecycle: Exists only during application session

let config = IndexerJobConfig::ephemeral_browse(path, scope);
// Results stored in memory, automatic cleanup

Enhanced Configuration

The new IndexerJobConfig provides fine-grained control:

pub struct IndexerJobConfig {
    pub location_id: Option<Uuid>,      // None for ephemeral
    pub path: SdPath,                   // Path to index
    pub mode: IndexMode,                // Shallow/Content/Deep
    pub scope: IndexScope,              // Current/Recursive
    pub persistence: IndexPersistence,  // Persistent/Ephemeral
    pub max_depth: Option<u32>,         // Depth limiting
}

Use Case Examples

// Fast current directory scan for UI
let config = IndexerJobConfig::ui_navigation(location_id, path);
// - Scope: Current (single level)
// - Mode: Shallow (metadata only)
// - Persistence: Persistent
// - Target: <500ms response time

External Path Browsing

// Browse USB drive without adding to library
let config = IndexerJobConfig::ephemeral_browse(usb_path, IndexScope::Current);
// - Scope: Current or Recursive
// - Mode: Shallow (configurable)
// - Persistence: Ephemeral
// - Target: Exploration without database pollution

Background Location Indexing

// Traditional full location scan
let config = IndexerJobConfig::new(location_id, path, IndexMode::Deep);
// - Scope: Recursive (default)
// - Mode: Deep (full analysis)
// - Persistence: Persistent
// - Target: Complete coverage

Ephemeral Index Structure

The EphemeralIndex provides temporary storage:

pub struct EphemeralIndex {
    pub entries: HashMap<PathBuf, EntryMetadata>,
    pub content_identities: HashMap<String, EphemeralContentIdentity>,
    pub created_at: Instant,
    pub last_accessed: Instant,
    pub root_path: PathBuf,
    pub stats: IndexerStats,
}

Features:

LRU Behavior: Automatic cleanup based on access time
Memory Efficient: Lightweight metadata storage
Session Scoped: Cleared on application restart
Fast Access: Direct HashMap lookups

Database Schema

Core Tables

`entries`

The main file/directory entry table using materialized paths:

- id: i32 (primary key)
- uuid: UUID
- location_id: i32 (→ locations)
- relative_path: String (materialized path - parent directory path)
- name: String (filename without extension)
- kind: i32 (0=File, 1=Directory, 2=Symlink)
- extension: String?
- size: i64
- aggregate_size: i64 (for directories)
- child_count: i32
- file_count: i32
- inode: i64? (for change detection)
- location_id: i32? (→ locations)
- content_id: i32? (→ content_identities)
- metadata_id: i32? (→ user_metadata)

Note: Parent-child relationships are determined by the relative_path field. For example:

A file at /documents/report.pdf has relative_path = "documents" and name = "report"
Its parent directory has relative_path = "" and name = "documents"

`content_identities`

Stores unique content for deduplication:

- id: i32 (primary key)
- uuid: UUID
- cas_id: String (content hash)
- cas_version: i16
- kind_id: i32 (→ content_kinds)
- mime_type_id: i32? (→ mime_types)
- total_size: i64
- entry_count: i32 (number of files with this content)

`content_kinds`

Static lookup table for content types:

- id: i32 (primary key, matches enum)
- name: String

Values:
0 = unknown
1 = image
2 = video
3 = audio
4 = document
5 = archive
6 = code
7 = text
8 = database
9 = book
10 = font
11 = mesh
12 = config
13 = encrypted
14 = key
15 = executable
16 = binary

`mime_types`

Dynamic table for discovered MIME types:

- id: i32 (primary key)
- uuid: UUID (for syncing)
- mime_type: String (unique)
- created_at: DateTime

File Type Detection

The system uses a multi-layered approach:

Extension Matching: Fast initial identification
Magic Bytes: Verifies file type by reading file headers
Content Analysis: For text files, analyzes content patterns
MIME Type Detection: Associates standard MIME types

Example flow:

let registry = FileTypeRegistry::default();
let result = registry.identify(path).await?;
// Returns: FileType with category, MIME types, and confidence level

Content Addressing (CAS)

The CAS system creates unique identifiers for file content:

Sampled Hashing: Reads chunks at specific offsets
Blake3 Hashing: Fast, cryptographically secure
Deduplication: Same content = same CAS ID

Benefits:

Instant duplicate detection
Content verification
Efficient storage references

Change Detection

The indexer efficiently detects changes using:

Inode Tracking: Platform-specific file identifiers
Modified Time: Fallback for systems without inodes
Size Comparison: Quick change indicator

Change types detected:

New files
Modified files
Deleted files
Moved files (same inode, different path)

Performance Optimizations

Batch Processing

Processes files in chunks of 1000
Reduces database round trips
Improves memory efficiency

Scope Optimizations

Current Scope: Direct directory read without recursion (<500ms target)
Recursive Scope: Efficient tree traversal with depth control
Ephemeral Mode: Memory-only storage for external path browsing
Early Termination: Configurable max_depth limiting

Caching

Entry ID cache for parent lookups
Change detection cache for inode/timestamp comparisons
Ephemeral index LRU cache for session-based storage
Content identity cache for deduplication

Parallelization

Concurrent CAS ID generation
Parallel file type detection
Async I/O operations
Batch processing across multiple threads

Database Optimizations

Bulk inserts with transaction batching
Prepared statements for repeated operations
Strategic indexing on location_id and relative_path
Persistence abstraction for database vs memory storage

Usage Examples

Enhanced Indexing Jobs

use sd_core::operations::indexing::{
    IndexerJob, IndexerJobConfig, IndexMode, IndexScope, IndexPersistence
};

// UI Navigation - Fast current directory scan
let config = IndexerJobConfig::ui_navigation(location_id, path);
let job = IndexerJob::new(config);
let handle = library.jobs().dispatch(job).await?;

// Ephemeral Browsing - External path exploration
let config = IndexerJobConfig::ephemeral_browse(external_path, IndexScope::Current);
let job = IndexerJob::new(config);
let handle = library.jobs().dispatch(job).await?;

// Traditional Location Indexing - Full recursive scan
let config = IndexerJobConfig::new(location_id, path, IndexMode::Deep);
let job = IndexerJob::new(config);
let handle = library.jobs().dispatch(job).await?;

// Custom Configuration - Fine-grained control
let mut config = IndexerJobConfig::new(location_id, path, IndexMode::Content);
config.scope = IndexScope::Current;
config.max_depth = Some(2);
let job = IndexerJob::new(config);

Legacy API (Backward Compatibility)

// Old API still works for simple cases
let job = IndexerJob::from_location(location_id, path, IndexMode::Deep);
let job = IndexerJob::shallow(location_id, path);
let job = IndexerJob::with_content(location_id, path);

Indexing Modes

Shallow: Metadata only (fastest, <500ms for UI)
Content: Includes CAS ID generation (moderate performance)
Deep: Full analysis including thumbnails (comprehensive)

Indexing Scopes

Current: Single directory level (UI navigation, quick browsing)
Recursive: Full directory tree (complete location indexing)

Persistence Options

Persistent: Database storage (managed locations, permanent data)
Ephemeral: Memory storage (external browsing, temporary exploration)

Metrics and Monitoring

The indexer tracks detailed metrics:

IndexerMetrics {
    total_items: u64,
    items_per_second: f64,
    bytes_per_second: f64,
    phase_durations: HashMap<String, Duration>,
    db_operations: (reads: u64, writes: u64),
    cache_stats: CacheStats,
}

Error Handling

Critical Errors

Stop indexing immediately:

Database connection lost
Filesystem errors
Permission denied on location root

Non-Critical Errors

Logged but indexing continues:

Permission denied on individual files
Corrupted file metadata
Unsupported file types

Future Enhancements

Thumbnail Generation: Integrated media thumbnail creation
Full-Text Indexing: Search within documents
AI Tagging: Automatic content categorization
Cloud Integration: Index cloud storage locations
Real-time Monitoring: Instant file change detection
Distributed Indexing: Multi-device collaborative indexing

Configuration

Filter Rules

IndexerRules {
    skip_hidden: bool,
    skip_system: bool,
    max_file_size: Option<u64>,
    allowed_extensions: Option<Vec<String>>,
    ignored_paths: Vec<PathBuf>,
}

Performance Tuning

IndexerConfig {
    batch_size: usize,        // Default: 1000
    checkpoint_interval: u64, // Default: 5000 items
    max_concurrent_io: usize, // Default: 100
    enable_content_id: bool,  // Default: true
}

// Enhanced configuration with scope and persistence
IndexerJobConfig {
    location_id: Option<Uuid>,         // None for ephemeral jobs
    path: SdPath,                      // Target path
    mode: IndexMode,                   // Shallow/Content/Deep
    scope: IndexScope,                 // Current/Recursive
    persistence: IndexPersistence,     // Persistent/Ephemeral
    max_depth: Option<u32>,            // Depth limiting for performance
}

// Ephemeral index settings
EphemeralConfig {
    max_entries: usize,                // Default: 10000
    cleanup_interval: Duration,        // Default: 5 minutes
    max_idle_time: Duration,           // Default: 30 minutes
    enable_content_analysis: bool,     // Default: false
}

Integration Points

The indexer integrates with:

Location System: Manages indexed locations
Job System: Provides resumability and progress
Event System: Emits progress and completion events
Sync System: Shares indexed data across devices
Search System: Powers file search functionality

Best Practices

Start with Shallow Mode: For initial quick results
Use Filters: Skip unnecessary files (node_modules, etc.)
Monitor Progress: Subscribe to indexing events
Handle Errors Gracefully: Check non-critical error counts
Regular Re-indexing: Schedule periodic deep scans

Technical Details

State Persistence

The indexer state is serialized using MessagePack for efficient storage and quick resume operations.

Memory Management

Streaming file processing (no full file loads)
Bounded channels for backpressure
Automatic batch flushing

Platform Support

Windows: Uses file index for inode equivalent
macOS: Native inode support
Linux: Full inode and permission tracking

CLI Usage

The indexing system provides comprehensive CLI access with enhanced scope and persistence options:

Enhanced Index Commands

# Start the daemon first
spacedrive start

# Quick scan for UI navigation (fast, current directory only)
spacedrive index quick-scan ~/Documents --scope current

# Quick scan with ephemeral mode (no database writes)
spacedrive index quick-scan /external/drive --scope current --ephemeral

# Browse external paths without adding to managed locations
spacedrive index browse /media/usb-drive --scope current
spacedrive index browse /network/share --scope recursive --content

# Index managed locations with specific scope and mode
spacedrive index location ~/Pictures --scope current --mode shallow
spacedrive index location <location-uuid> --scope recursive --mode deep

Location Management

# Add locations with different indexing modes
spacedrive location add ~/Documents --mode shallow    # Fast metadata only
spacedrive location add ~/Pictures --mode content     # With content hashing
spacedrive location add ~/Videos --mode deep          # Full media analysis

# Force re-indexing of a location
spacedrive location rescan <location-id> --force

Legacy Commands (Backward Compatibility)

# Traditional indexing (creates location and starts full scan)
spacedrive scan ~/Desktop --mode content --watch

Monitoring and Status

# Monitor indexing progress in real-time
spacedrive job monitor

# Check job status with scope/persistence info
spacedrive job list --status running

# Get detailed job information
spacedrive job info <job-id>

Command Comparison

Command	Scope	Persistence	Use Case
`index quick-scan`	Current/Recursive	Persistent/Ephemeral	UI navigation, quick browsing
`index browse`	Current/Recursive	Ephemeral	External path exploration
`index location`	Current/Recursive	Persistent	Managed location updates
`scan` (legacy)	Recursive	Persistent	Traditional full indexing
`location add`	Recursive	Persistent	Add new managed locations

For complete CLI documentation, see CLI Documentation.

Debugging

Enable detailed logging:

# For CLI daemon
spacedrive start --foreground -v

# For development
RUST_LOG=sd_core::operations::indexing=debug cargo run

Common issues:

Slow indexing: Check filter rules and batch sizes
High memory usage: Reduce batch size
Missing files: Verify permissions and filter rules
No progress shown: Ensure daemon is running and use spacedrive job monitor

18 KiB Raw Blame History