spacedrive/docs/core/indexing.mdx
2025-11-14 21:31:21 -08:00

392 lines
11 KiB
Plaintext

---
title: Indexing
sidebarTitle: Indexing
---
The indexing system discovers and analyzes your files through a sophisticated multi-phase process. Built on Spacedrive's job system, it provides resumable operations, real-time progress tracking, and supports both persistent library indexing and ephemeral browsing of external drives.
## Architecture Overview
The indexing system consists of several key components working together:
**IndexerJob** orchestrates the entire indexing process as a resumable job. It maintains state across application restarts and provides detailed progress reporting.
**IndexerState** preserves all necessary information to resume indexing from any interruption point. This includes the current phase, directories to process, and accumulated statistics.
**EntryProcessor** handles the complex task of creating and updating database records while maintaining referential integrity through materialized paths.
**FileTypeRegistry** identifies files through a combination of extensions, magic bytes, and content analysis to provide accurate type detection.
The system integrates deeply with Spacedrive's job infrastructure, which provides automatic state persistence through MessagePack serialization. When you pause an indexing operation, the entire job state is saved to a dedicated jobs database, allowing seamless resumption even after application restarts.
<Note>
Indexing jobs can run for hours on large directories. The resumable
architecture ensures no work is lost if interrupted.
</Note>
## Indexing Phases
The indexer operates through four distinct phases, each designed to be interruptible and resumable:
### Phase 1: Discovery
The discovery phase walks your filesystem to build a list of all files and directories. This phase is optimized for speed, collecting just enough information to plan the work ahead:
```rust
// Discovery maintains a queue of directories to process
pub struct DiscoveryPhase {
dirs_to_walk: VecDeque<PathBuf>,
seen_paths: HashSet<PathBuf>, // Cycle detection
}
```
The phase uses a breadth-first traversal to ensure shallow directories are processed first, providing quicker initial results. Progress is measured by directories discovered versus total estimated.
### Phase 2: Processing
Processing creates or updates database entries for each discovered item. This is where Spacedrive builds its understanding of your file structure:
```rust
// Batch processing for efficiency
const BATCH_SIZE: usize = 1000;
// Process entries in parent-first order
let sorted_batch = batch.sort_by_depth();
persistence.process_batch(sorted_batch, &mut entry_cache)?;
```
The system uses materialized paths instead of parent IDs, making queries faster and eliminating complex recursive lookups. Each entry stores its full path prefix, enabling instant directory listings.
### Phase 3: Aggregation
Aggregation calculates sizes and counts for directories by traversing the tree bottom-up. This phase provides the statistics you see in the UI:
- Total size including subdirectories
- Direct child count
- Recursive file count
- Aggregate content types
### Phase 4: Content Identification
The final phase generates content-addressed storage (CAS) identifiers and performs deep file analysis:
```rust
// Sampled hashing for large files
let cas_id = cas_generator
.generate_cas_id(path, file_size)
.await?;
// Link to content identity for deduplication
content_processor.link_or_create(entry_id, cas_id)?;
```
This phase enables deduplication, content-based search, and file tracking across renames.
## Indexing Modes and Scopes
The system provides flexible configuration through modes and scopes:
### Index Modes
**Shallow Mode** extracts only filesystem metadata (name, size, dates). Completes in under 500ms for typical directories. Perfect for responsive UI navigation.
**Content Mode** adds cryptographic hashing to identify files by content. Enables deduplication and content tracking. Moderate performance impact.
**Deep Mode** performs full analysis including thumbnails and media metadata extraction. Best for photo and video libraries.
### Index Scopes
**Current Scope** indexes only the immediate directory contents:
```rust
IndexerJobConfig::ui_navigation(location_id, path)
```
**Recursive Scope** indexes the entire directory tree:
```rust
IndexerJobConfig::new(location_id, path, IndexMode::Deep)
```
## Persistence and Ephemeral Indexing
One of Spacedrive's key innovations is supporting both persistent and ephemeral indexing modes.
### Persistent Indexing
Persistent indexing stores all data in the database permanently. This is the default for library locations:
- Full change detection and history
- Syncs across devices
- Survives application restarts
- Enables offline search
### Ephemeral Indexing
Ephemeral indexing keeps data in memory only, perfect for browsing external drives:
```rust
let config = IndexerJobConfig::ephemeral_browse(
usb_path,
IndexScope::Current
);
```
The ephemeral index uses an LRU cache with automatic cleanup:
- No database writes
- Session-based lifetime
- Memory-efficient storage
- Automatic expiration
<Info>
Ephemeral mode lets you explore USB drives or network shares without
permanently adding them to your library.
</Info>
## Job System Integration
The indexing system leverages Spacedrive's job infrastructure for reliability and monitoring.
### State Persistence
When interrupted, the entire job state is serialized:
```rust
#[derive(Serialize, Deserialize)]
pub struct IndexerState {
phase: Phase,
dirs_to_walk: VecDeque<PathBuf>,
entry_batches: Vec<Vec<DirEntry>>,
entry_id_cache: HashMap<PathBuf, i32>,
stats: IndexerStats,
// ... checkpoint data
}
```
This state is stored in the jobs database, separate from your library data. On resume, the job picks up exactly where it left off.
### Progress Tracking
Real-time progress flows through multiple channels:
```rust
pub struct IndexerProgress {
phase: String,
items_done: u64,
total_items: u64,
bytes_per_second: f64,
eta_seconds: Option<u32>,
}
```
Progress updates are:
- Sent to UI via channels
- Persisted to database
- Available through job queries
- Used for time estimates
### Error Handling
The job system provides structured error handling:
**Non-critical errors** are accumulated but don't stop indexing:
- Permission denied on individual files
- Corrupted metadata
- Unsupported file types
**Critical errors** halt the job with state preserved:
- Database connection lost
- Filesystem unmounted
- Out of disk space
## Database Schema
The indexer populates several key tables designed for query performance.
### Entries Table
The core table uses materialized paths for efficient queries:
```sql
CREATE TABLE entries (
id INTEGER PRIMARY KEY,
uuid UUID UNIQUE,
location_id INTEGER,
relative_path TEXT, -- Parent path (materialized)
name TEXT, -- Without extension
extension TEXT,
kind INTEGER, -- 0=File, 1=Directory
size BIGINT,
inode BIGINT, -- Change detection
content_id INTEGER
);
-- Key indexes for performance
CREATE INDEX idx_entries_location_path
ON entries(location_id, relative_path);
```
### Content Identities Table
Enables deduplication across your library:
```sql
CREATE TABLE content_identities (
id INTEGER PRIMARY KEY,
cas_id TEXT UNIQUE,
kind_id INTEGER,
total_size BIGINT,
entry_count INTEGER
);
```
## Performance Characteristics
Indexing performance varies by mode and scope:
| Configuration | Performance | Use Case |
| ------------------- | -------------- | --------------- |
| Current + Shallow | `<500ms` | UI navigation |
| Recursive + Shallow | ~10K files/sec | Quick scan |
| Recursive + Content | ~1K files/sec | Normal indexing |
| Recursive + Deep | ~100 files/sec | Media libraries |
### Optimization Techniques
**Batch Processing**: Groups operations into transactions of 1,000 items, reducing database overhead by 30x.
**Parallel I/O**: Content identification runs on multiple threads, saturating disk bandwidth on fast storage.
**Smart Caching**: The entry ID cache eliminates redundant parent lookups, critical for deep directory trees.
**Checkpoint Strategy**: Checkpoints occur every 5,000 items or 30 seconds, balancing durability with performance.
## Change Detection
The indexer efficiently detects changes without full rescans:
```rust
// Platform-specific change detection
#[cfg(unix)]
let file_id = metadata.ino(); // inode
#[cfg(windows)]
let file_id = get_file_index(path)?; // File index
```
Detection capabilities:
- New files: Appear with unknown inodes
- Modified files: Same inode, different size/mtime
- Moved files: Known inode at new path
- Deleted files: Missing from filesystem walk
## Usage Examples
### Quick UI Navigation
For responsive directory browsing:
```rust
let config = IndexerJobConfig::ui_navigation(location_id, path);
let handle = library.jobs().dispatch(IndexerJob::new(config)).await?;
```
### External Drive Browsing
Explore without permanent storage:
```rust
let config = IndexerJobConfig::ephemeral_browse(
usb_path,
IndexScope::Recursive
);
let job = IndexerJob::new(config);
```
### Full Library Location
Comprehensive indexing with all features:
```rust
let config = IndexerJobConfig::new(
location_id,
path,
IndexMode::Deep
);
config.with_checkpointing(true)
.with_filters(indexer_rules);
```
## CLI Commands
The indexer is fully accessible through the CLI:
```bash
# Quick current directory scan
spacedrive index quick-scan ~/Documents
# Browse external drive
spacedrive index browse /media/usb --ephemeral
# Full location with progress monitoring
spacedrive index location ~/Pictures --mode deep
spacedrive job monitor # Watch progress
```
## Troubleshooting
### Common Issues
**Slow Indexing**: Check for large node_modules or build directories. Use `.spacedriveignore` files to exclude them.
**High Memory Usage**: Reduce batch size or avoid ephemeral mode for very large directories.
**Resume Not Working**: Ensure the jobs database isn't corrupted. Check logs for serialization errors.
### Debug Tools
Enable detailed logging:
```bash
RUST_LOG=sd_core::ops::indexing=debug spacedrive start
```
Inspect job state:
```bash
spacedrive job info <job-id> --detailed
```
## Platform Notes
**Windows**: Uses file indices for change detection. Supports long paths transparently. Network drives may require polling.
**macOS**: Leverages FSEvents and native inodes. Integrates with Time Machine exclusions. APFS provides efficient cloning.
**Linux**: Full inode support with detailed permissions. Handles diverse filesystems from ext4 to ZFS. Symbolic links supported with cycle detection.
## Best Practices
1. **Start shallow** for new locations to verify configuration
2. **Use filters** to exclude build artifacts and caches
3. **Monitor progress** through the job system instead of polling
4. **Schedule deep scans** during low-usage periods
5. **Enable checkpointing** for locations over 100K files
<Warning>
Always let indexing jobs complete or pause them properly. Force-killing can
corrupt the job state.
</Warning>
## Related Documentation
- [Jobs](/docs/core/jobs) - Job system architecture
- [Locations](/docs/core/locations) - Directory management
- [Search](/docs/core/search) - Querying indexed data
- [Performance](/docs/core/performance) - Optimization guide