mirror of
https://github.com/spacedriveapp/spacedrive.git
synced 2025-12-11 20:15:30 +01:00
392 lines
11 KiB
Plaintext
392 lines
11 KiB
Plaintext
---
|
|
title: Indexing
|
|
sidebarTitle: Indexing
|
|
---
|
|
|
|
The indexing system discovers and analyzes your files through a sophisticated multi-phase process. Built on Spacedrive's job system, it provides resumable operations, real-time progress tracking, and supports both persistent library indexing and ephemeral browsing of external drives.
|
|
|
|
## Architecture Overview
|
|
|
|
The indexing system consists of several key components working together:
|
|
|
|
**IndexerJob** orchestrates the entire indexing process as a resumable job. It maintains state across application restarts and provides detailed progress reporting.
|
|
|
|
**IndexerState** preserves all necessary information to resume indexing from any interruption point. This includes the current phase, directories to process, and accumulated statistics.
|
|
|
|
**EntryProcessor** handles the complex task of creating and updating database records while maintaining referential integrity through materialized paths.
|
|
|
|
**FileTypeRegistry** identifies files through a combination of extensions, magic bytes, and content analysis to provide accurate type detection.
|
|
|
|
The system integrates deeply with Spacedrive's job infrastructure, which provides automatic state persistence through MessagePack serialization. When you pause an indexing operation, the entire job state is saved to a dedicated jobs database, allowing seamless resumption even after application restarts.
|
|
|
|
<Note>
|
|
Indexing jobs can run for hours on large directories. The resumable
|
|
architecture ensures no work is lost if interrupted.
|
|
</Note>
|
|
|
|
## Indexing Phases
|
|
|
|
The indexer operates through four distinct phases, each designed to be interruptible and resumable:
|
|
|
|
### Phase 1: Discovery
|
|
|
|
The discovery phase walks your filesystem to build a list of all files and directories. This phase is optimized for speed, collecting just enough information to plan the work ahead:
|
|
|
|
```rust
|
|
// Discovery maintains a queue of directories to process
|
|
pub struct DiscoveryPhase {
|
|
dirs_to_walk: VecDeque<PathBuf>,
|
|
seen_paths: HashSet<PathBuf>, // Cycle detection
|
|
}
|
|
```
|
|
|
|
The phase uses a breadth-first traversal to ensure shallow directories are processed first, providing quicker initial results. Progress is measured by directories discovered versus total estimated.
|
|
|
|
### Phase 2: Processing
|
|
|
|
Processing creates or updates database entries for each discovered item. This is where Spacedrive builds its understanding of your file structure:
|
|
|
|
```rust
|
|
// Batch processing for efficiency
|
|
const BATCH_SIZE: usize = 1000;
|
|
|
|
// Process entries in parent-first order
|
|
let sorted_batch = batch.sort_by_depth();
|
|
persistence.process_batch(sorted_batch, &mut entry_cache)?;
|
|
```
|
|
|
|
The system uses materialized paths instead of parent IDs, making queries faster and eliminating complex recursive lookups. Each entry stores its full path prefix, enabling instant directory listings.
|
|
|
|
### Phase 3: Aggregation
|
|
|
|
Aggregation calculates sizes and counts for directories by traversing the tree bottom-up. This phase provides the statistics you see in the UI:
|
|
|
|
- Total size including subdirectories
|
|
- Direct child count
|
|
- Recursive file count
|
|
- Aggregate content types
|
|
|
|
### Phase 4: Content Identification
|
|
|
|
The final phase generates content-addressed storage (CAS) identifiers and performs deep file analysis:
|
|
|
|
```rust
|
|
// Sampled hashing for large files
|
|
let cas_id = cas_generator
|
|
.generate_cas_id(path, file_size)
|
|
.await?;
|
|
|
|
// Link to content identity for deduplication
|
|
content_processor.link_or_create(entry_id, cas_id)?;
|
|
```
|
|
|
|
This phase enables deduplication, content-based search, and file tracking across renames.
|
|
|
|
## Indexing Modes and Scopes
|
|
|
|
The system provides flexible configuration through modes and scopes:
|
|
|
|
### Index Modes
|
|
|
|
**Shallow Mode** extracts only filesystem metadata (name, size, dates). Completes in under 500ms for typical directories. Perfect for responsive UI navigation.
|
|
|
|
**Content Mode** adds cryptographic hashing to identify files by content. Enables deduplication and content tracking. Moderate performance impact.
|
|
|
|
**Deep Mode** performs full analysis including thumbnails and media metadata extraction. Best for photo and video libraries.
|
|
|
|
### Index Scopes
|
|
|
|
**Current Scope** indexes only the immediate directory contents:
|
|
|
|
```rust
|
|
IndexerJobConfig::ui_navigation(location_id, path)
|
|
```
|
|
|
|
**Recursive Scope** indexes the entire directory tree:
|
|
|
|
```rust
|
|
IndexerJobConfig::new(location_id, path, IndexMode::Deep)
|
|
```
|
|
|
|
## Persistence and Ephemeral Indexing
|
|
|
|
One of Spacedrive's key innovations is supporting both persistent and ephemeral indexing modes.
|
|
|
|
### Persistent Indexing
|
|
|
|
Persistent indexing stores all data in the database permanently. This is the default for library locations:
|
|
|
|
- Full change detection and history
|
|
- Syncs across devices
|
|
- Survives application restarts
|
|
- Enables offline search
|
|
|
|
### Ephemeral Indexing
|
|
|
|
Ephemeral indexing keeps data in memory only, perfect for browsing external drives:
|
|
|
|
```rust
|
|
let config = IndexerJobConfig::ephemeral_browse(
|
|
usb_path,
|
|
IndexScope::Current
|
|
);
|
|
```
|
|
|
|
The ephemeral index uses an LRU cache with automatic cleanup:
|
|
|
|
- No database writes
|
|
- Session-based lifetime
|
|
- Memory-efficient storage
|
|
- Automatic expiration
|
|
|
|
<Info>
|
|
Ephemeral mode lets you explore USB drives or network shares without
|
|
permanently adding them to your library.
|
|
</Info>
|
|
|
|
## Job System Integration
|
|
|
|
The indexing system leverages Spacedrive's job infrastructure for reliability and monitoring.
|
|
|
|
### State Persistence
|
|
|
|
When interrupted, the entire job state is serialized:
|
|
|
|
```rust
|
|
#[derive(Serialize, Deserialize)]
|
|
pub struct IndexerState {
|
|
phase: Phase,
|
|
dirs_to_walk: VecDeque<PathBuf>,
|
|
entry_batches: Vec<Vec<DirEntry>>,
|
|
entry_id_cache: HashMap<PathBuf, i32>,
|
|
stats: IndexerStats,
|
|
// ... checkpoint data
|
|
}
|
|
```
|
|
|
|
This state is stored in the jobs database, separate from your library data. On resume, the job picks up exactly where it left off.
|
|
|
|
### Progress Tracking
|
|
|
|
Real-time progress flows through multiple channels:
|
|
|
|
```rust
|
|
pub struct IndexerProgress {
|
|
phase: String,
|
|
items_done: u64,
|
|
total_items: u64,
|
|
bytes_per_second: f64,
|
|
eta_seconds: Option<u32>,
|
|
}
|
|
```
|
|
|
|
Progress updates are:
|
|
|
|
- Sent to UI via channels
|
|
- Persisted to database
|
|
- Available through job queries
|
|
- Used for time estimates
|
|
|
|
### Error Handling
|
|
|
|
The job system provides structured error handling:
|
|
|
|
**Non-critical errors** are accumulated but don't stop indexing:
|
|
|
|
- Permission denied on individual files
|
|
- Corrupted metadata
|
|
- Unsupported file types
|
|
|
|
**Critical errors** halt the job with state preserved:
|
|
|
|
- Database connection lost
|
|
- Filesystem unmounted
|
|
- Out of disk space
|
|
|
|
## Database Schema
|
|
|
|
The indexer populates several key tables designed for query performance.
|
|
|
|
### Entries Table
|
|
|
|
The core table uses materialized paths for efficient queries:
|
|
|
|
```sql
|
|
CREATE TABLE entries (
|
|
id INTEGER PRIMARY KEY,
|
|
uuid UUID UNIQUE,
|
|
location_id INTEGER,
|
|
relative_path TEXT, -- Parent path (materialized)
|
|
name TEXT, -- Without extension
|
|
extension TEXT,
|
|
kind INTEGER, -- 0=File, 1=Directory
|
|
size BIGINT,
|
|
inode BIGINT, -- Change detection
|
|
content_id INTEGER
|
|
);
|
|
|
|
-- Key indexes for performance
|
|
CREATE INDEX idx_entries_location_path
|
|
ON entries(location_id, relative_path);
|
|
```
|
|
|
|
### Content Identities Table
|
|
|
|
Enables deduplication across your library:
|
|
|
|
```sql
|
|
CREATE TABLE content_identities (
|
|
id INTEGER PRIMARY KEY,
|
|
cas_id TEXT UNIQUE,
|
|
kind_id INTEGER,
|
|
total_size BIGINT,
|
|
entry_count INTEGER
|
|
);
|
|
```
|
|
|
|
## Performance Characteristics
|
|
|
|
Indexing performance varies by mode and scope:
|
|
|
|
| Configuration | Performance | Use Case |
|
|
| ------------------- | -------------- | --------------- |
|
|
| Current + Shallow | `<500ms` | UI navigation |
|
|
| Recursive + Shallow | ~10K files/sec | Quick scan |
|
|
| Recursive + Content | ~1K files/sec | Normal indexing |
|
|
| Recursive + Deep | ~100 files/sec | Media libraries |
|
|
|
|
### Optimization Techniques
|
|
|
|
**Batch Processing**: Groups operations into transactions of 1,000 items, reducing database overhead by 30x.
|
|
|
|
**Parallel I/O**: Content identification runs on multiple threads, saturating disk bandwidth on fast storage.
|
|
|
|
**Smart Caching**: The entry ID cache eliminates redundant parent lookups, critical for deep directory trees.
|
|
|
|
**Checkpoint Strategy**: Checkpoints occur every 5,000 items or 30 seconds, balancing durability with performance.
|
|
|
|
## Change Detection
|
|
|
|
The indexer efficiently detects changes without full rescans:
|
|
|
|
```rust
|
|
// Platform-specific change detection
|
|
#[cfg(unix)]
|
|
let file_id = metadata.ino(); // inode
|
|
|
|
#[cfg(windows)]
|
|
let file_id = get_file_index(path)?; // File index
|
|
```
|
|
|
|
Detection capabilities:
|
|
|
|
- New files: Appear with unknown inodes
|
|
- Modified files: Same inode, different size/mtime
|
|
- Moved files: Known inode at new path
|
|
- Deleted files: Missing from filesystem walk
|
|
|
|
## Usage Examples
|
|
|
|
### Quick UI Navigation
|
|
|
|
For responsive directory browsing:
|
|
|
|
```rust
|
|
let config = IndexerJobConfig::ui_navigation(location_id, path);
|
|
let handle = library.jobs().dispatch(IndexerJob::new(config)).await?;
|
|
```
|
|
|
|
### External Drive Browsing
|
|
|
|
Explore without permanent storage:
|
|
|
|
```rust
|
|
let config = IndexerJobConfig::ephemeral_browse(
|
|
usb_path,
|
|
IndexScope::Recursive
|
|
);
|
|
let job = IndexerJob::new(config);
|
|
```
|
|
|
|
### Full Library Location
|
|
|
|
Comprehensive indexing with all features:
|
|
|
|
```rust
|
|
let config = IndexerJobConfig::new(
|
|
location_id,
|
|
path,
|
|
IndexMode::Deep
|
|
);
|
|
config.with_checkpointing(true)
|
|
.with_filters(indexer_rules);
|
|
```
|
|
|
|
## CLI Commands
|
|
|
|
The indexer is fully accessible through the CLI:
|
|
|
|
```bash
|
|
# Quick current directory scan
|
|
spacedrive index quick-scan ~/Documents
|
|
|
|
# Browse external drive
|
|
spacedrive index browse /media/usb --ephemeral
|
|
|
|
# Full location with progress monitoring
|
|
spacedrive index location ~/Pictures --mode deep
|
|
spacedrive job monitor # Watch progress
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Slow Indexing**: Check for large node_modules or build directories. Use `.spacedriveignore` files to exclude them.
|
|
|
|
**High Memory Usage**: Reduce batch size or avoid ephemeral mode for very large directories.
|
|
|
|
**Resume Not Working**: Ensure the jobs database isn't corrupted. Check logs for serialization errors.
|
|
|
|
### Debug Tools
|
|
|
|
Enable detailed logging:
|
|
|
|
```bash
|
|
RUST_LOG=sd_core::ops::indexing=debug spacedrive start
|
|
```
|
|
|
|
Inspect job state:
|
|
|
|
```bash
|
|
spacedrive job info <job-id> --detailed
|
|
```
|
|
|
|
## Platform Notes
|
|
|
|
**Windows**: Uses file indices for change detection. Supports long paths transparently. Network drives may require polling.
|
|
|
|
**macOS**: Leverages FSEvents and native inodes. Integrates with Time Machine exclusions. APFS provides efficient cloning.
|
|
|
|
**Linux**: Full inode support with detailed permissions. Handles diverse filesystems from ext4 to ZFS. Symbolic links supported with cycle detection.
|
|
|
|
## Best Practices
|
|
|
|
1. **Start shallow** for new locations to verify configuration
|
|
2. **Use filters** to exclude build artifacts and caches
|
|
3. **Monitor progress** through the job system instead of polling
|
|
4. **Schedule deep scans** during low-usage periods
|
|
5. **Enable checkpointing** for locations over 100K files
|
|
|
|
<Warning>
|
|
Always let indexing jobs complete or pause them properly. Force-killing can
|
|
corrupt the job state.
|
|
</Warning>
|
|
|
|
## Related Documentation
|
|
|
|
- [Jobs](/docs/core/jobs) - Job system architecture
|
|
- [Locations](/docs/core/locations) - Directory management
|
|
- [Search](/docs/core/search) - Querying indexed data
|
|
- [Performance](/docs/core/performance) - Optimization guide
|