spacedrive/docs/design/core/indexer/INDEXER_PROGRESS.md
2025-10-11 11:11:25 -07:00

7.2 KiB
Raw Blame History

Indexer Implementation Progress

Last Updated: 2025-06-19

Overview

The new indexer has been rewritten with a phase-based architecture that prioritizes simplicity, maintainability, and performance. This document tracks the implementation progress compared to the original indexer in core/crates/heavy-lifting/src/indexer/.

Architecture

The new indexer uses a clean phase-based pipeline:

  • Discovery Phase: Directory traversal and entry collection
  • Processing Phase: Database entry creation and updates with parent relationships
  • Aggregation Phase: Calculate directory sizes and child counts
  • Content Identification Phase: CAS ID generation and deduplication
  • Complete Phase: Final cleanup and metrics reporting

Implemented Features

Core Functionality

  • Multi-phase indexing architecture - Clean separation of concerns
  • Full job system integration - Pause, resume, cancel support
  • State persistence - Full state serialization for resumability
  • Checkpoint system - Periodic state saves every 5000 files
  • Batch processing - Configurable batch sizes (default 1000)
  • Progress reporting - Detailed progress with phase tracking

Change Detection & Incremental Updates

  • Inode-based tracking - Cross-platform inode extraction
  • Move/rename detection - Tracks files moved within indexed locations
  • Modification detection - Size and timestamp comparison
  • Deletion detection - Identifies removed files
  • New file detection - Finds newly added files
  • Configurable time precision - Handles filesystem timestamp limitations

Performance & Monitoring

  • Comprehensive metrics - Per-phase timing and throughput
  • Error statistics - Categorized error tracking
  • Database operation tracking - Insert/update/delete counts
  • Throughput calculations - Files/dirs/bytes per second
  • Non-critical error collection - Graceful degradation

File System Integration

  • Cross-platform metadata extraction - Unix permissions, timestamps
  • Hidden file detection - Platform-specific hidden file handling
  • Symlink type detection - Identifies symbolic links
  • Directory traversal - Efficient async directory reading
  • Loop detection - Prevents infinite loops in symlinked directories

Content Management

  • CAS ID generation - Content-addressable storage integration
  • Content deduplication - Links multiple entries to same content
  • Parallel hashing - Chunked parallel processing for performance
  • Entry count tracking - Tracks references per content identity

Database Optimization

  • Path prefix normalization - Reduces storage redundancy
  • Prefix caching - Improves performance for common prefixes
  • Efficient updates - Only updates changed fields
  • Batch operations - Reduces database round trips

Not Implemented

Deep Indexing Features

  • Thumbnail generation - Image/video preview generation
  • Text extraction - Full-text search support
  • Media metadata - EXIF, ID3, video metadata
  • MIME type detection - Accurate file type identification
  • Content analysis - File format validation
  • Archive inspection - Look inside zip/tar files

Directory Management

  • Size aggregation - Calculate directory sizes
  • Parent-child relationships - Track directory hierarchy with parent_id
  • Directory statistics - File count, child count tracking
  • Efficient hierarchical queries - Indexed parent_id for fast lookups

Rules System

  • Database-backed rules - User-configurable indexing rules
  • Per-location rules - Different rules for different locations
  • Glob pattern matching - Include/exclude by pattern
  • Git ignore integration - Respect .gitignore files
  • Rule compilation - Efficient rule evaluation
  • UI for rule management - User interface for configuration

Advanced Features

  • Network file support - Full SMB/NFS handling
  • Cloud storage integration - Index cloud providers
  • Indexing priorities - User-defined indexing order
  • Partial indexing - Index specific subdirectories only

Partially Implemented

Memory Management

  • Structure exists in metrics
  • Actual memory tracking
  • Memory limit enforcement
  • Adaptive batch sizing

Location Integration

  • Basic location support
  • Multiple location coordination
  • Location-specific settings
  • Cross-location deduplication

Implementation Comparison

Feature Old Indexer New Indexer Status
Architecture Task-based with 7 stages Phase-based with 5 phases Simplified
State Management Complex serialization Direct JSON/MessagePack Improved
Change Detection Full implementation Full implementation Complete
Rules System Database-backed, complex Hardcoded filters only Missing
Performance Parallel tasks, streaming Batch processing, metrics Different approach
Content Identity Basic CAS support Full deduplication system Enhanced
Error Handling Critical/non-critical Categorized collection Improved
Directory Sizes Materialized paths Parent ID + aggregation Enhanced
Deep Indexing Not implemented Framework exists In progress
Sync Support Full CRDT integration Not planned yet Deferred

Priority TODOs

  1. Implement Rules System - Critical for user control

    • Design rule storage schema
    • Implement rule evaluation engine
    • Add git ignore support
    • Create UI for rule management
  2. Deep Indexing Phase - Enhanced functionality

    • Integrate thumbnail generation
    • Add text extraction
    • Implement media metadata extraction
  3. Memory Management - Production readiness

    • Implement actual memory tracking
    • Add adaptive batch sizing
    • Enforce memory limits
  4. Testing & Documentation

    • Add comprehensive test coverage
    • Document public APIs
    • Create integration examples

Notes

  • The new indexer prioritizes correctness and maintainability over complex optimizations
  • CRDT sync support is intentionally deferred to a later phase
  • The phase-based architecture makes it easier to add new processing steps
  • Real-time file system monitoring is handled by the separate location_watcher service (see /core/src/services/location_watcher/ and /core/docs/design/WATCHER_VDFS_INTEGRATION.md)
  • Directory sizes are calculated in a dedicated aggregation phase, making them more accurate and efficient than the old materialized path approach
  • Parent-child relationships use explicit parent_id references instead of materialized paths, enabling more flexible hierarchical queries
  • Current implementation provides a solid foundation for future enhancements