mirror of
https://github.com/spacedriveapp/spacedrive.git
synced 2025-12-11 20:15:30 +01:00
7.2 KiB
7.2 KiB
Indexer Implementation Progress
Last Updated: 2025-06-19
Overview
The new indexer has been rewritten with a phase-based architecture that prioritizes simplicity, maintainability, and performance. This document tracks the implementation progress compared to the original indexer in core/crates/heavy-lifting/src/indexer/.
Architecture
The new indexer uses a clean phase-based pipeline:
- Discovery Phase: Directory traversal and entry collection
- Processing Phase: Database entry creation and updates with parent relationships
- Aggregation Phase: Calculate directory sizes and child counts
- Content Identification Phase: CAS ID generation and deduplication
- Complete Phase: Final cleanup and metrics reporting
Implemented Features
Core Functionality
- Multi-phase indexing architecture - Clean separation of concerns
- Full job system integration - Pause, resume, cancel support
- State persistence - Full state serialization for resumability
- Checkpoint system - Periodic state saves every 5000 files
- Batch processing - Configurable batch sizes (default 1000)
- Progress reporting - Detailed progress with phase tracking
Change Detection & Incremental Updates
- Inode-based tracking - Cross-platform inode extraction
- Move/rename detection - Tracks files moved within indexed locations
- Modification detection - Size and timestamp comparison
- Deletion detection - Identifies removed files
- New file detection - Finds newly added files
- Configurable time precision - Handles filesystem timestamp limitations
Performance & Monitoring
- Comprehensive metrics - Per-phase timing and throughput
- Error statistics - Categorized error tracking
- Database operation tracking - Insert/update/delete counts
- Throughput calculations - Files/dirs/bytes per second
- Non-critical error collection - Graceful degradation
File System Integration
- Cross-platform metadata extraction - Unix permissions, timestamps
- Hidden file detection - Platform-specific hidden file handling
- Symlink type detection - Identifies symbolic links
- Directory traversal - Efficient async directory reading
- Loop detection - Prevents infinite loops in symlinked directories
Content Management
- CAS ID generation - Content-addressable storage integration
- Content deduplication - Links multiple entries to same content
- Parallel hashing - Chunked parallel processing for performance
- Entry count tracking - Tracks references per content identity
Database Optimization
- Path prefix normalization - Reduces storage redundancy
- Prefix caching - Improves performance for common prefixes
- Efficient updates - Only updates changed fields
- Batch operations - Reduces database round trips
Not Implemented
Deep Indexing Features
- Thumbnail generation - Image/video preview generation
- Text extraction - Full-text search support
- Media metadata - EXIF, ID3, video metadata
- MIME type detection - Accurate file type identification
- Content analysis - File format validation
- Archive inspection - Look inside zip/tar files
Directory Management
- Size aggregation - Calculate directory sizes
- Parent-child relationships - Track directory hierarchy with parent_id
- Directory statistics - File count, child count tracking
- Efficient hierarchical queries - Indexed parent_id for fast lookups
Rules System
- Database-backed rules - User-configurable indexing rules
- Per-location rules - Different rules for different locations
- Glob pattern matching - Include/exclude by pattern
- Git ignore integration - Respect .gitignore files
- Rule compilation - Efficient rule evaluation
- UI for rule management - User interface for configuration
Advanced Features
- Network file support - Full SMB/NFS handling
- Cloud storage integration - Index cloud providers
- Indexing priorities - User-defined indexing order
- Partial indexing - Index specific subdirectories only
Partially Implemented
Memory Management
- Structure exists in metrics
- Actual memory tracking
- Memory limit enforcement
- Adaptive batch sizing
Location Integration
- Basic location support
- Multiple location coordination
- Location-specific settings
- Cross-location deduplication
Implementation Comparison
| Feature | Old Indexer | New Indexer | Status |
|---|---|---|---|
| Architecture | Task-based with 7 stages | Phase-based with 5 phases | Simplified |
| State Management | Complex serialization | Direct JSON/MessagePack | Improved |
| Change Detection | Full implementation | Full implementation | Complete |
| Rules System | Database-backed, complex | Hardcoded filters only | Missing |
| Performance | Parallel tasks, streaming | Batch processing, metrics | Different approach |
| Content Identity | Basic CAS support | Full deduplication system | Enhanced |
| Error Handling | Critical/non-critical | Categorized collection | Improved |
| Directory Sizes | Materialized paths | Parent ID + aggregation | Enhanced |
| Deep Indexing | Not implemented | Framework exists | In progress |
| Sync Support | Full CRDT integration | Not planned yet | ️ Deferred |
Priority TODOs
-
Implement Rules System - Critical for user control
- Design rule storage schema
- Implement rule evaluation engine
- Add git ignore support
- Create UI for rule management
-
Deep Indexing Phase - Enhanced functionality
- Integrate thumbnail generation
- Add text extraction
- Implement media metadata extraction
-
Memory Management - Production readiness
- Implement actual memory tracking
- Add adaptive batch sizing
- Enforce memory limits
-
Testing & Documentation
- Add comprehensive test coverage
- Document public APIs
- Create integration examples
Notes
- The new indexer prioritizes correctness and maintainability over complex optimizations
- CRDT sync support is intentionally deferred to a later phase
- The phase-based architecture makes it easier to add new processing steps
- Real-time file system monitoring is handled by the separate
location_watcherservice (see/core/src/services/location_watcher/and/core/docs/design/WATCHER_VDFS_INTEGRATION.md) - Directory sizes are calculated in a dedicated aggregation phase, making them more accurate and efficient than the old materialized path approach
- Parent-child relationships use explicit parent_id references instead of materialized paths, enabling more flexible hierarchical queries
- Current implementation provides a solid foundation for future enhancements