Gittes-archive/spacedrive

Fork 0

mirror of https://github.com/spacedriveapp/spacedrive.git synced 2025-12-11 20:15:30 +01:00

Jamie Pine 35ac1f214f sorting docs

2025-10-11 11:11:25 -07:00

7.2 KiB

Raw Blame History

Indexer Implementation Progress

Last Updated: 2025-06-19

Overview

The new indexer has been rewritten with a phase-based architecture that prioritizes simplicity, maintainability, and performance. This document tracks the implementation progress compared to the original indexer in core/crates/heavy-lifting/src/indexer/.

Architecture

The new indexer uses a clean phase-based pipeline:

Discovery Phase: Directory traversal and entry collection
Processing Phase: Database entry creation and updates with parent relationships
Aggregation Phase: Calculate directory sizes and child counts
Content Identification Phase: CAS ID generation and deduplication
Complete Phase: Final cleanup and metrics reporting

Implemented Features

Core Functionality

Multi-phase indexing architecture - Clean separation of concerns
Full job system integration - Pause, resume, cancel support
State persistence - Full state serialization for resumability
Checkpoint system - Periodic state saves every 5000 files
Batch processing - Configurable batch sizes (default 1000)
Progress reporting - Detailed progress with phase tracking

Change Detection & Incremental Updates

Inode-based tracking - Cross-platform inode extraction
Move/rename detection - Tracks files moved within indexed locations
Modification detection - Size and timestamp comparison
Deletion detection - Identifies removed files
New file detection - Finds newly added files
Configurable time precision - Handles filesystem timestamp limitations

Performance & Monitoring

Comprehensive metrics - Per-phase timing and throughput
Error statistics - Categorized error tracking
Database operation tracking - Insert/update/delete counts
Throughput calculations - Files/dirs/bytes per second
Non-critical error collection - Graceful degradation

File System Integration

Cross-platform metadata extraction - Unix permissions, timestamps
Hidden file detection - Platform-specific hidden file handling
Symlink type detection - Identifies symbolic links
Directory traversal - Efficient async directory reading
Loop detection - Prevents infinite loops in symlinked directories

Content Management

CAS ID generation - Content-addressable storage integration
Content deduplication - Links multiple entries to same content
Parallel hashing - Chunked parallel processing for performance
Entry count tracking - Tracks references per content identity

Database Optimization

Path prefix normalization - Reduces storage redundancy
Prefix caching - Improves performance for common prefixes
Efficient updates - Only updates changed fields
Batch operations - Reduces database round trips

Not Implemented

Deep Indexing Features

Thumbnail generation - Image/video preview generation
Text extraction - Full-text search support
Media metadata - EXIF, ID3, video metadata
MIME type detection - Accurate file type identification
Content analysis - File format validation
Archive inspection - Look inside zip/tar files

Directory Management

Size aggregation - Calculate directory sizes
Parent-child relationships - Track directory hierarchy with parent_id
Directory statistics - File count, child count tracking
Efficient hierarchical queries - Indexed parent_id for fast lookups

Rules System

Database-backed rules - User-configurable indexing rules
Per-location rules - Different rules for different locations
Glob pattern matching - Include/exclude by pattern
Git ignore integration - Respect .gitignore files
Rule compilation - Efficient rule evaluation
UI for rule management - User interface for configuration

Advanced Features

Network file support - Full SMB/NFS handling
Cloud storage integration - Index cloud providers
Indexing priorities - User-defined indexing order
Partial indexing - Index specific subdirectories only

Partially Implemented

Memory Management

Structure exists in metrics
Actual memory tracking
Memory limit enforcement
Adaptive batch sizing

Location Integration

Basic location support
Multiple location coordination
Location-specific settings
Cross-location deduplication

Implementation Comparison

Feature	Old Indexer	New Indexer	Status
Architecture	Task-based with 7 stages	Phase-based with 5 phases	Simplified
State Management	Complex serialization	Direct JSON/MessagePack	Improved
Change Detection	Full implementation	Full implementation	Complete
Rules System	Database-backed, complex	Hardcoded filters only	Missing
Performance	Parallel tasks, streaming	Batch processing, metrics	Different approach
Content Identity	Basic CAS support	Full deduplication system	Enhanced
Error Handling	Critical/non-critical	Categorized collection	Improved
Directory Sizes	Materialized paths	Parent ID + aggregation	Enhanced
Deep Indexing	Not implemented	Framework exists	In progress
Sync Support	Full CRDT integration	Not planned yet	️ Deferred

Priority TODOs

Implement Rules System - Critical for user control
- Design rule storage schema
- Implement rule evaluation engine
- Add git ignore support
- Create UI for rule management
Deep Indexing Phase - Enhanced functionality
- Integrate thumbnail generation
- Add text extraction
- Implement media metadata extraction
Memory Management - Production readiness
- Implement actual memory tracking
- Add adaptive batch sizing
- Enforce memory limits
Testing & Documentation
- Add comprehensive test coverage
- Document public APIs
- Create integration examples

Notes

The new indexer prioritizes correctness and maintainability over complex optimizations
CRDT sync support is intentionally deferred to a later phase
The phase-based architecture makes it easier to add new processing steps
Real-time file system monitoring is handled by the separate location_watcher service (see /core/src/services/location_watcher/ and /core/docs/design/WATCHER_VDFS_INTEGRATION.md)
Directory sizes are calculated in a dedicated aggregation phase, making them more accurate and efficient than the old materialized path approach
Parent-child relationships use explicit parent_id references instead of materialized paths, enabling more flexible hierarchical queries
Current implementation provides a solid foundation for future enhancements

7.2 KiB Raw Blame History Unescape Escape