spacedrive/benchmarks.md at a5c9717dcb809c9dacd70ca8316b4517fa61b7ce

mirror of https://github.com/spacedriveapp/spacedrive.git synced 2025-12-11 20:15:30 +01:00

Jamie Pine b46042f9b6 feat: Add desktop-scale benchmark recipes for realistic testing

- Introduced two new benchmark recipes: `desktop_complex.yaml` and `desktop_extreme.yaml`.
- `desktop_complex.yaml` simulates a realistic desktop environment with 500k files and 8 levels of directory nesting.
- `desktop_extreme.yaml` targets power users with 1M files and 12 levels, featuring a comprehensive file type coverage and realistic size distribution.
- Updated documentation to include details about the new benchmark recipes and their intended use cases.

2025-09-20 03:22:25 -07:00

19 KiB

Raw Blame History

Benchmarking Suite

This document explains how to use and extend the benchmarking suite that lives in benchmarks/. It covers concepts, CLI commands, recipe schema, data generation, scenarios, metrics, reporting, CI guidance, and troubleshooting.

Goals

Reliable, reproducible performance evaluation of core workflows (e.g., indexing discovery, content identification).
Modular architecture: add scenarios, reporters, and data generators without touching the core wiring.
CI-friendly: deterministic runs, structured outputs, small quick recipes for PR checks.

Overview

benchmarks/ is a standalone Rust crate that provides:
- CLI binary: sd-bench
- Dataset generator(s): benchmarks/src/generator/
- Scenarios: benchmarks/src/scenarios/
- Runner & metrics: benchmarks/src/runner/, benchmarks/src/metrics/
- Reporting: benchmarks/src/reporting/
- Recipes (YAML): benchmarks/recipes/
- Results (JSON): benchmarks/results/
The CLI boots the core in an isolated data directory, enables job logging, creates/opens a dedicated benchmark library if needed, and orchestrates scenario execution.

Installation

Requirements: Rust toolchain, workspace builds.
Build the bench crate:
- cargo build -p sd-bench --bin sd-bench

Quickstart

Generate one recipe:
- cargo run -p sd-bench -- mkdata --recipe benchmarks/recipes/shape_small.yaml
Generate all recipes in a directory (default locations under locations[].path in each recipe):
- cargo run -p sd-bench -- mkdata-all --recipes-dir benchmarks/recipes
Generate datasets on an external disk without changing recipes (prefix relative recipe paths):
- cargo run -p sd-bench -- mkdata-all --recipes-dir benchmarks/recipes --dataset-root /Volumes/YourHDD
Run one scenario with one recipe and write a JSON summary:
- Discovery: cargo run -p sd-bench -- run --scenario indexing-discovery --recipe benchmarks/recipes/shape_small.yaml --out-json benchmarks/results/shape_small-indexing-discovery-nvme.json
- Content identification: cargo run -p sd-bench -- run --scenario content-identification --recipe benchmarks/recipes/shape_small.yaml --out-json benchmarks/results/shape_small-content-identification-nvme.json

NEW: Run all scenarios on multiple locations with automatic hardware detection:

# Run all scenarios (discovery, aggregation, content-id) on both NVMe and HDD
cargo run -p sd-bench -- run-all --locations "/tmp/benchdata" "/Volumes/Seagate/benchdata"

# Run specific scenarios on multiple locations
cargo run -p sd-bench -- run-all \
  --scenarios indexing-discovery aggregation \
  --locations "/Users/me/benchdata" "/Volumes/HDD/benchdata" "/Volumes/SSD/benchdata"

# Filter to only shape recipes
cargo run -p sd-bench -- run-all \
  --locations "/tmp/benchdata" "/Volumes/Seagate/benchdata" \
  --recipe-filter "^shape_"

Generate CSV reports from JSON summaries:
- cargo run -p sd-bench -- results-table --results-dir benchmarks/results --out benchmarks/results/whitepaper_metrics.csv --format csv

The CLI always prints a brief stdout summary and (if applicable) the path to the generated JSON. It also prints job log paths for later inspection.

Commands

mkdata --recipe <path> [--dataset-root <path>]
- Generates a dataset based on a YAML recipe (see Recipe Schema below).
- With --dataset-root, any relative locations[].path in the recipe is prefixed with this path (absolute paths are left unchanged). Useful for targeting an external HDD.
mkdata-all [--recipes-dir <dir>] [--dataset-root <path>] [--recipe-filter <regex>]
- Scans a directory for .yaml / .yml and runs mkdata for each file.
- --dataset-root prefixes relative locations[].path as above.
- --recipe-filter filters recipe files by filename (regex applied to file stem), e.g. ^hdd_.
run --scenario <name> --recipe <path> [--out-json <path>] [--dataset-root <path>]
- Boots an isolated core, ensures a benchmark library, adds recipe locations, waits for jobs to finish.
- Summarizes metrics to stdout; optionally writes JSON summary at --out-json.
- --dataset-root prefixes relative locations[].path at runtime (absolute paths untouched).
run-all [--scenarios <names...>] [--locations <paths...>] [--recipes-dir <dir>] [--out-dir <dir>] [--skip-generate] [--recipe-filter <regex>]
- Enhanced for multi-location, multi-scenario benchmarking with automatic hardware detection
- Runs all combinations of scenarios × locations × recipes, automatically detecting hardware type from volume information.
- --scenarios: Optional list of scenarios to run. If not specified, runs all: indexing-discovery, aggregation, content-identification.
- --locations: List of paths where datasets should be generated/benchmarked. Hardware type is automatically detected from the volume (e.g., NVMe, HDD, SSD).
- Output files are automatically named: {recipe}-{scenario}-{hardware}.json (e.g., shape_small-indexing-discovery-nvme.json).
- With --skip-generate, it will not generate datasets and expects them to exist.
- --recipe-filter selects a subset of recipes by regex on filename stem (e.g., ^shape_ for shape recipes only).
- The system automatically handles the benchdata/ prefix in recipes, so you can specify /tmp/benchdata and it will create /tmp/benchdata/shape_small etc.

Architecture

Thin bin: benchmarks/src/bin/sd-bench-new.rs delegates to benchmarks/src/cli/commands.rs.
Core modules exported via benchmarks/src/mod_new.rs:
- generator/ (dataset generation)
- scenarios/ (Scenario trait implementations)
- runner/ (orchestration & report emission)
- metrics/ (result model and phase timings)
- reporting/ (reporters like JSON)
- core_boot/ (isolated core boot + job logging)
- recipe/ (schema + validation)
- util/ (helpers)

Recipe Schema

YAML schema (see benchmarks/recipes/*.yaml). Recipe names no longer need hardware prefixes - hardware is auto-detected. Example:

name: shape_small
seed: 12345
locations:
  - path: benchdata/shape_small  # Note: 'benchdata/' prefix is handled automatically
    structure:
      depth: 2
      fanout_per_dir: 8
    files:
      total: 5000
      size_buckets:
        small: { range: [4096, 131072], share: 0.6 }
        medium: { range: [1048576, 5242880], share: 0.3 }
        large: { range: [5242880, 10485760], share: 0.1 }
      extensions: [pdf, zip, jpg, txt]
      duplicate_ratio: 0.1
      content_gen:
        mode: partial # zeros | partial | full
        sample_block_size: 10240 # 10 KiB; aligns with content hashing sample size
        magic_headers: true # write registry-derived magic bytes
media:
  generate_thumbnails: false

Desktop-Scale Recipes

For testing realistic desktop scenarios, including job resumption and long-running indexing operations:

desktop_complex.yaml - Realistic desktop environment (500k files, 8 levels deep):

name: desktop_complex
seed: 42424242
locations:
  - path: benchdata/desktop_complex
    structure:
      depth: 8  # Deep nesting like real file systems
      fanout_per_dir: 25  # Many directories per level
    files:
      total: 500000  # Half million files - realistic desktop scale
      size_buckets:
        tiny: { range: [0, 4096], share: 0.25 }
        small: { range: [4096, 1048576], share: 0.35 }
        medium: { range: [1048576, 50000000], share: 0.25 }
        large: { range: [50000000, 500000000], share: 0.10 }
        huge: { range: [500000000, 4000000000], share: 0.05 }
      extensions: [txt, md, pdf, jpg, png, mp4, zip, py, js, rs, # ... many more
      duplicate_ratio: 0.15
      content_gen:
        mode: partial
        sample_block_size: 10240
        magic_headers: true

desktop_extreme.yaml - Power user environment (1M files, 12 levels deep):

1,000,000 files across 12 directory levels
Comprehensive file type coverage (100+ extensions)
Realistic size distribution including very large files (up to 8GB)
20% duplicate ratio for realistic backup/copy scenarios

Fields

name: logical recipe name.
seed: RNG seed (deterministic runs). If omitted, one is derived from entropy.
locations[]:
- path: base directory for generated files.
- structure.depth: max nested subdirectory depth (randomized per file up to this depth).
- structure.fanout_per_dir: number of subdirectory options at each level.
- files.total: total files per location (before duplicates).
- files.size_buckets: map of bucket name => { range: [min, max], share }; shares are normalized.
- files.extensions: file extension sampling pool (e.g., [pdf, zip, jpg, txt]).
- files.duplicate_ratio: fraction of duplicates (hardlink, fallback to copy).
- files.content_gen:
  - mode:
    - zeros: sparse file; fast; not realistic for content identification.
    - partial: writes header + evenly spaced samples + footer; gaps remain sparse zeros; matches content hashing sampling points.
    - full: fills the entire file with deterministic bytes; slowest, most realistic.
  - sample_block_size: size of each inner sample block (default 10 KiB). Leave at 10 KiB to match the content hashing algorithm.
  - magic_headers: if true, writes file signature patterns based on the file_type registry for the chosen extension.
media (reserved for future synthetic media generation; currently optional/no-op by default).

Content Generation Details

The generator can write content that aligns with the content hash sampling algorithm in src/domain/content_identity.rs:
- For large files (> 100 KiB):
  - Includes file size (handled by the hash function).
  - Hashes a header (8 KiB), 4 evenly spaced inner samples (default 10 KiB each), and a footer (8 KiB).
- For small files: full-content hashing.
partial mode writes the header/samples/footer only (deterministic pseudo-random bytes), leaving gaps as sparse zeros. This yields realistic, stable hashes without full writes.
full mode writes deterministic content for the entire file for maximum realism.
magic_headers: true uses sd_core::file_type::FileTypeRegistry to write magic byte signatures for the chosen extension when available.

Scenarios

Implement Scenario in benchmarks/src/scenarios/ and register in scenarios/registry.rs.
Built-in:
- indexing-discovery: Adds locations (shallow indexing) and waits for indexing jobs to complete; collects metrics.
- content-identification: Runs content mode and reports content-only throughput using phase timings (excludes discovery).

Adding a scenario

Create benchmarks/src/scenarios/<your_scenario>.rs implementing:
- name(&self) -> &'static str
- describe(&self) -> &'static str
- prepare(&mut self, boot: &CoreBoot, recipe: &Recipe)
- run(&mut self, boot: &CoreBoot, recipe: &Recipe)
Register it in benchmarks/src/scenarios/registry.rs.

Metrics and Phase Timing

The indexer logs a formatted summary including phase timings (discovery, processing, content). The bench runner parses these logs (temporary approach) and produces ScenarioResult with:
- duration_s: total duration
- discovery_duration_s, processing_duration_s, content_duration_s: optional phase timings
- throughput and counts (files, dirs, total size, errors)
- raw_artifacts: paths to job logs
For content-only benchmarking, use content_duration_s to compute throughput and exclude discovery time.
Future: event-driven or structured metrics ingestion to avoid log parsing.

Reporting

JSON reporter writes summaries into a single JSON:
- benchmarks/src/reporting/json_summary.rs writes { "runs": [ ...ScenarioResult... ] }.
Register additional reporters in benchmarks/src/reporting/registry.rs.
Planned: Markdown, CSV, HTML.

CSV Reports

After producing JSON results (e.g., via run or run-all), generate CSV reports:
- cargo run -p sd-bench -- results-table --results-dir benchmarks/results --out benchmarks/results/whitepaper_metrics.csv --format csv
The CSV format shows all individual benchmark runs with automatic hardware detection:
- Header: Phase,Hardware,Files_per_s,GB_per_s,Files,Dirs,GB,Errors,Recipe
- Each row represents one benchmark run
- Phase names: "Discovery" (indexing-discovery), "Processing" (aggregation), "Content Identification" (content-identification)
- Hardware labels are automatically detected from the volume where the benchmark was run (e.g., "Internal NVMe SSD", "External HDD (Seagate)")
- Results are sorted by phase, then hardware, then recipe name
- The LaTeX document reads ../benchmarks/results/whitepaper_metrics.csv
Other supported formats:
- --format json: Export as JSON (default)
- --format markdown: Generate a markdown table (useful for documentation)

Core Boot (Isolated)

The bench boot uses its own data dir, e.g. ~/Library/Application Support/spacedrive-bench/<scenario> or the system temp dir fallback.
Job logging is enabled and sized for benchmarks. Job logs are printed after each run and are included as artifacts in results.
A dedicated library is created/used for benchmark runs.

Key Features & Improvements

Automatic Hardware Detection

The benchmark suite now automatically detects hardware type from the volume where benchmarks are run
No need for hardware-specific recipe names or manual tagging
Detects: Internal/External NVMe SSD, HDD, SSD, Network Attached Storage
Hardware information is included in output filenames and benchmark results

Multi-Location, Multi-Scenario Execution

Run all benchmark combinations with a single command
Automatically generates datasets at each location if needed
Output files are named systematically: {recipe}-{scenario}-{hardware}.json
Example: shape_small-indexing-discovery-nvme.json

Smart Path Handling

The benchdata/ prefix in recipes is handled intelligently
Specify /tmp/benchdata as location, and it creates /tmp/benchdata/shape_small (not /tmp/benchdata/benchdata/shape_small)
Works seamlessly with external drives and network volumes

Enhanced Reporting

CSV reporter shows all individual runs (not aggregated)
Results are sorted by phase → hardware → recipe for easy comparison
Hardware labels are human-readable (e.g., "External HDD (Seagate)")

Best Practices

For comprehensive benchmarking across hardware:

cargo run -p sd-bench -- run-all \
  --locations "/path/to/nvme" "/Volumes/HDD" "/Volumes/SSD" \
  --recipe-filter "^shape_"

For fast iteration, use smaller recipes (shape_small.yaml) and content_gen.mode: partial.
For realistic content identification, set magic_headers: true and content_gen.mode: partial or full for a subset of files.
Keep seeds fixed in CI to avoid result variance.

CI Integration

Add a job that runs a tiny recipe end-to-end and uploads the JSON summary artifacts (and optionally logs) for inspection.
Suggested command:
- cargo run -p sd-bench -- run --scenario indexing-discovery --recipe benchmarks/recipes/nvme_tiny.yaml --out-json benchmarks/results/ci-indexing-discovery.json

Troubleshooting

“Files look empty / zeros”: ensure your recipe has files.content_gen defined with mode: partial or full, and consider magic_headers: true.
“Unknown scenario”: run with --scenario indexing-discovery or add your scenario to scenarios/registry.rs.
“No recipes found”: check --recipes-dir path and that files end with .yaml or .yml.

Extending the Suite

Add a generator: implement DatasetGenerator in benchmarks/src/generator/, register in generator/registry.rs.
Add a reporter: implement Reporter in benchmarks/src/reporting/, register in reporting/registry.rs.
Add a scenario: see the Scenarios section above.

References

CLI entrypoint and commands: benchmarks/src/bin/sd-bench-new.rs, benchmarks/src/cli/commands.rs
Dataset generation: benchmarks/src/generator/filesystem.rs
Recipe schema: benchmarks/src/recipe/schema.rs
Scenarios: benchmarks/src/scenarios/
Runner: benchmarks/src/runner/mod.rs
Metrics: benchmarks/src/metrics/mod.rs
Reporting: benchmarks/src/reporting/
Isolated core boot: benchmarks/src/core_boot/mod.rs

Future Benchmarks & Roadmap

The suite is designed to grow into a comprehensive performance harness that reflects the whitepaper and system goals.

Indexing pipeline
- Content identification (done): measure content-only throughput using phase timings.
- Deep indexing: include thumbnail generation and metadata extraction; track throughput and error rates.
- Rescan/change detection: cold vs warm cache; latency from change to consistency.
File operations
- Copy throughput: large vs small files, overlap detection, progressive copy correctness; bytes/s and resource usage.
- Delete/cleanup: large tree deletion, DB cleanup cost, vacuum.
- Validation/integrity: CAS verification throughput; corruption handling.
Duplicates & de-duplication
- Duplicate detection: time to detect N duplicates; content-identity correctness; DB write pressure.
Search & querying
- (If applicable) index build time and query latency (P50/P95); warm vs cold cache comparisons.
Media pipeline
- Thumbnail generation: per-kind throughput; GPU/CPU offload if available.
- Metadata extraction: EXIF/FFprobe across formats.
Networking & transfer
- Pairing: time-to-pair and success rate under various conditions.
- Cross-device transfer: LAN/WAN throughput and latency; concurrency sweeps.
Volume & system
- Volume detection and tracking: discovery latency; multi-volume scaling.
- Disk type profiling: HDD vs NVMe vs network FS; impact on indexing and copy.

Data generation enhancements

Media synthesis: small valid PNG/JPG/WebP; short MP4/AAC clips.
Rich content sets: archives (ZIP/TAR), PDFs, docs, code, text; symlinks/permissions; nested trees.
Change-set support: scripted add/modify/delete to exercise rescan.
Ground-truth manifests: emitted metadata (size, hash) to validate correctness.

Metrics & telemetry

Structured metrics export from jobs (avoid log parsing).
System snapshot per run: CPU/RAM, disk model/FS, OS; thermal state if available.
Resource usage: CPU%, RSS/peak, IO bytes/IOPS.

Reporting & analysis

Markdown/CSV reporters; baseline-diff mode for regression detection.
HTML dashboard for trend charts over time/history.

CLI ergonomics

--list-scenarios, --list-reporters; recipe filters; scenario parameters (mode, scope, concurrency).
--timeout, --retries, --clean/--reuse; max parallelism; sharding.

CI integration

PR smoke tests: tiny recipes for key scenarios; upload JSON/logs.
Nightly heavy runs on tagged hardware; publish time-series metrics.
Regression gates: fail PRs on significant metric regressions.

19 KiB Raw Blame History Unescape Escape