spacedrive/docs/benchmarks.md
Jamie Pine b46042f9b6 feat: Add desktop-scale benchmark recipes for realistic testing
- Introduced two new benchmark recipes: `desktop_complex.yaml` and `desktop_extreme.yaml`.
- `desktop_complex.yaml` simulates a realistic desktop environment with 500k files and 8 levels of directory nesting.
- `desktop_extreme.yaml` targets power users with 1M files and 12 levels, featuring a comprehensive file type coverage and realistic size distribution.
- Updated documentation to include details about the new benchmark recipes and their intended use cases.
2025-09-20 03:22:25 -07:00

19 KiB
Raw Blame History

Benchmarking Suite

This document explains how to use and extend the benchmarking suite that lives in benchmarks/. It covers concepts, CLI commands, recipe schema, data generation, scenarios, metrics, reporting, CI guidance, and troubleshooting.

Goals

  • Reliable, reproducible performance evaluation of core workflows (e.g., indexing discovery, content identification).
  • Modular architecture: add scenarios, reporters, and data generators without touching the core wiring.
  • CI-friendly: deterministic runs, structured outputs, small quick recipes for PR checks.

Overview

  • benchmarks/ is a standalone Rust crate that provides:

    • CLI binary: sd-bench
    • Dataset generator(s): benchmarks/src/generator/
    • Scenarios: benchmarks/src/scenarios/
    • Runner & metrics: benchmarks/src/runner/, benchmarks/src/metrics/
    • Reporting: benchmarks/src/reporting/
    • Recipes (YAML): benchmarks/recipes/
    • Results (JSON): benchmarks/results/
  • The CLI boots the core in an isolated data directory, enables job logging, creates/opens a dedicated benchmark library if needed, and orchestrates scenario execution.

Installation

  • Requirements: Rust toolchain, workspace builds.
  • Build the bench crate:
    • cargo build -p sd-bench --bin sd-bench

Quickstart

  • Generate one recipe:

    • cargo run -p sd-bench -- mkdata --recipe benchmarks/recipes/shape_small.yaml
  • Generate all recipes in a directory (default locations under locations[].path in each recipe):

    • cargo run -p sd-bench -- mkdata-all --recipes-dir benchmarks/recipes
  • Generate datasets on an external disk without changing recipes (prefix relative recipe paths):

    • cargo run -p sd-bench -- mkdata-all --recipes-dir benchmarks/recipes --dataset-root /Volumes/YourHDD
  • Run one scenario with one recipe and write a JSON summary:

    • Discovery: cargo run -p sd-bench -- run --scenario indexing-discovery --recipe benchmarks/recipes/shape_small.yaml --out-json benchmarks/results/shape_small-indexing-discovery-nvme.json
    • Content identification: cargo run -p sd-bench -- run --scenario content-identification --recipe benchmarks/recipes/shape_small.yaml --out-json benchmarks/results/shape_small-content-identification-nvme.json
  • NEW: Run all scenarios on multiple locations with automatic hardware detection:

    # Run all scenarios (discovery, aggregation, content-id) on both NVMe and HDD
    cargo run -p sd-bench -- run-all --locations "/tmp/benchdata" "/Volumes/Seagate/benchdata"
    
    # Run specific scenarios on multiple locations
    cargo run -p sd-bench -- run-all \
      --scenarios indexing-discovery aggregation \
      --locations "/Users/me/benchdata" "/Volumes/HDD/benchdata" "/Volumes/SSD/benchdata"
    
    # Filter to only shape recipes
    cargo run -p sd-bench -- run-all \
      --locations "/tmp/benchdata" "/Volumes/Seagate/benchdata" \
      --recipe-filter "^shape_"
    
  • Generate CSV reports from JSON summaries:

    • cargo run -p sd-bench -- results-table --results-dir benchmarks/results --out benchmarks/results/whitepaper_metrics.csv --format csv

The CLI always prints a brief stdout summary and (if applicable) the path to the generated JSON. It also prints job log paths for later inspection.

Commands

  • mkdata --recipe <path> [--dataset-root <path>]
    • Generates a dataset based on a YAML recipe (see Recipe Schema below).
    • With --dataset-root, any relative locations[].path in the recipe is prefixed with this path (absolute paths are left unchanged). Useful for targeting an external HDD.
  • mkdata-all [--recipes-dir <dir>] [--dataset-root <path>] [--recipe-filter <regex>]
    • Scans a directory for .yaml / .yml and runs mkdata for each file.
    • --dataset-root prefixes relative locations[].path as above.
    • --recipe-filter filters recipe files by filename (regex applied to file stem), e.g. ^hdd_.
  • run --scenario <name> --recipe <path> [--out-json <path>] [--dataset-root <path>]
    • Boots an isolated core, ensures a benchmark library, adds recipe locations, waits for jobs to finish.
    • Summarizes metrics to stdout; optionally writes JSON summary at --out-json.
    • --dataset-root prefixes relative locations[].path at runtime (absolute paths untouched).
  • run-all [--scenarios <names...>] [--locations <paths...>] [--recipes-dir <dir>] [--out-dir <dir>] [--skip-generate] [--recipe-filter <regex>]
    • Enhanced for multi-location, multi-scenario benchmarking with automatic hardware detection
    • Runs all combinations of scenarios × locations × recipes, automatically detecting hardware type from volume information.
    • --scenarios: Optional list of scenarios to run. If not specified, runs all: indexing-discovery, aggregation, content-identification.
    • --locations: List of paths where datasets should be generated/benchmarked. Hardware type is automatically detected from the volume (e.g., NVMe, HDD, SSD).
    • Output files are automatically named: {recipe}-{scenario}-{hardware}.json (e.g., shape_small-indexing-discovery-nvme.json).
    • With --skip-generate, it will not generate datasets and expects them to exist.
    • --recipe-filter selects a subset of recipes by regex on filename stem (e.g., ^shape_ for shape recipes only).
    • The system automatically handles the benchdata/ prefix in recipes, so you can specify /tmp/benchdata and it will create /tmp/benchdata/shape_small etc.

Architecture

  • Thin bin: benchmarks/src/bin/sd-bench-new.rs delegates to benchmarks/src/cli/commands.rs.
  • Core modules exported via benchmarks/src/mod_new.rs:
    • generator/ (dataset generation)
    • scenarios/ (Scenario trait implementations)
    • runner/ (orchestration & report emission)
    • metrics/ (result model and phase timings)
    • reporting/ (reporters like JSON)
    • core_boot/ (isolated core boot + job logging)
    • recipe/ (schema + validation)
    • util/ (helpers)

Recipe Schema

YAML schema (see benchmarks/recipes/*.yaml). Recipe names no longer need hardware prefixes - hardware is auto-detected. Example:

name: shape_small
seed: 12345
locations:
  - path: benchdata/shape_small  # Note: 'benchdata/' prefix is handled automatically
    structure:
      depth: 2
      fanout_per_dir: 8
    files:
      total: 5000
      size_buckets:
        small: { range: [4096, 131072], share: 0.6 }
        medium: { range: [1048576, 5242880], share: 0.3 }
        large: { range: [5242880, 10485760], share: 0.1 }
      extensions: [pdf, zip, jpg, txt]
      duplicate_ratio: 0.1
      content_gen:
        mode: partial # zeros | partial | full
        sample_block_size: 10240 # 10 KiB; aligns with content hashing sample size
        magic_headers: true # write registry-derived magic bytes
media:
  generate_thumbnails: false

Desktop-Scale Recipes

For testing realistic desktop scenarios, including job resumption and long-running indexing operations:

desktop_complex.yaml - Realistic desktop environment (500k files, 8 levels deep):

name: desktop_complex
seed: 42424242
locations:
  - path: benchdata/desktop_complex
    structure:
      depth: 8  # Deep nesting like real file systems
      fanout_per_dir: 25  # Many directories per level
    files:
      total: 500000  # Half million files - realistic desktop scale
      size_buckets:
        tiny: { range: [0, 4096], share: 0.25 }
        small: { range: [4096, 1048576], share: 0.35 }
        medium: { range: [1048576, 50000000], share: 0.25 }
        large: { range: [50000000, 500000000], share: 0.10 }
        huge: { range: [500000000, 4000000000], share: 0.05 }
      extensions: [txt, md, pdf, jpg, png, mp4, zip, py, js, rs, # ... many more
      duplicate_ratio: 0.15
      content_gen:
        mode: partial
        sample_block_size: 10240
        magic_headers: true

desktop_extreme.yaml - Power user environment (1M files, 12 levels deep):

  • 1,000,000 files across 12 directory levels
  • Comprehensive file type coverage (100+ extensions)
  • Realistic size distribution including very large files (up to 8GB)
  • 20% duplicate ratio for realistic backup/copy scenarios

Fields

  • name: logical recipe name.
  • seed: RNG seed (deterministic runs). If omitted, one is derived from entropy.
  • locations[]:
    • path: base directory for generated files.
    • structure.depth: max nested subdirectory depth (randomized per file up to this depth).
    • structure.fanout_per_dir: number of subdirectory options at each level.
    • files.total: total files per location (before duplicates).
    • files.size_buckets: map of bucket name => { range: [min, max], share }; shares are normalized.
    • files.extensions: file extension sampling pool (e.g., [pdf, zip, jpg, txt]).
    • files.duplicate_ratio: fraction of duplicates (hardlink, fallback to copy).
    • files.content_gen:
      • mode:
        • zeros: sparse file; fast; not realistic for content identification.
        • partial: writes header + evenly spaced samples + footer; gaps remain sparse zeros; matches content hashing sampling points.
        • full: fills the entire file with deterministic bytes; slowest, most realistic.
      • sample_block_size: size of each inner sample block (default 10 KiB). Leave at 10 KiB to match the content hashing algorithm.
      • magic_headers: if true, writes file signature patterns based on the file_type registry for the chosen extension.
  • media (reserved for future synthetic media generation; currently optional/no-op by default).

Content Generation Details

  • The generator can write content that aligns with the content hash sampling algorithm in src/domain/content_identity.rs:
    • For large files (> 100 KiB):
      • Includes file size (handled by the hash function).
      • Hashes a header (8 KiB), 4 evenly spaced inner samples (default 10 KiB each), and a footer (8 KiB).
    • For small files: full-content hashing.
  • partial mode writes the header/samples/footer only (deterministic pseudo-random bytes), leaving gaps as sparse zeros. This yields realistic, stable hashes without full writes.
  • full mode writes deterministic content for the entire file for maximum realism.
  • magic_headers: true uses sd_core::file_type::FileTypeRegistry to write magic byte signatures for the chosen extension when available.

Scenarios

  • Implement Scenario in benchmarks/src/scenarios/ and register in scenarios/registry.rs.
  • Built-in:
    • indexing-discovery: Adds locations (shallow indexing) and waits for indexing jobs to complete; collects metrics.
    • content-identification: Runs content mode and reports content-only throughput using phase timings (excludes discovery).

Adding a scenario

  • Create benchmarks/src/scenarios/<your_scenario>.rs implementing:
    • name(&self) -> &'static str
    • describe(&self) -> &'static str
    • prepare(&mut self, boot: &CoreBoot, recipe: &Recipe)
    • run(&mut self, boot: &CoreBoot, recipe: &Recipe)
  • Register it in benchmarks/src/scenarios/registry.rs.

Metrics and Phase Timing

  • The indexer logs a formatted summary including phase timings (discovery, processing, content). The bench runner parses these logs (temporary approach) and produces ScenarioResult with:
    • duration_s: total duration
    • discovery_duration_s, processing_duration_s, content_duration_s: optional phase timings
    • throughput and counts (files, dirs, total size, errors)
    • raw_artifacts: paths to job logs
  • For content-only benchmarking, use content_duration_s to compute throughput and exclude discovery time.
  • Future: event-driven or structured metrics ingestion to avoid log parsing.

Reporting

  • JSON reporter writes summaries into a single JSON:
    • benchmarks/src/reporting/json_summary.rs writes { "runs": [ ...ScenarioResult... ] }.
  • Register additional reporters in benchmarks/src/reporting/registry.rs.
  • Planned: Markdown, CSV, HTML.

CSV Reports

  • After producing JSON results (e.g., via run or run-all), generate CSV reports:

    • cargo run -p sd-bench -- results-table --results-dir benchmarks/results --out benchmarks/results/whitepaper_metrics.csv --format csv
  • The CSV format shows all individual benchmark runs with automatic hardware detection:

    • Header: Phase,Hardware,Files_per_s,GB_per_s,Files,Dirs,GB,Errors,Recipe
    • Each row represents one benchmark run
    • Phase names: "Discovery" (indexing-discovery), "Processing" (aggregation), "Content Identification" (content-identification)
    • Hardware labels are automatically detected from the volume where the benchmark was run (e.g., "Internal NVMe SSD", "External HDD (Seagate)")
    • Results are sorted by phase, then hardware, then recipe name
    • The LaTeX document reads ../benchmarks/results/whitepaper_metrics.csv
  • Other supported formats:

    • --format json: Export as JSON (default)
    • --format markdown: Generate a markdown table (useful for documentation)

Core Boot (Isolated)

  • The bench boot uses its own data dir, e.g. ~/Library/Application Support/spacedrive-bench/<scenario> or the system temp dir fallback.
  • Job logging is enabled and sized for benchmarks. Job logs are printed after each run and are included as artifacts in results.
  • A dedicated library is created/used for benchmark runs.

Key Features & Improvements

Automatic Hardware Detection

  • The benchmark suite now automatically detects hardware type from the volume where benchmarks are run
  • No need for hardware-specific recipe names or manual tagging
  • Detects: Internal/External NVMe SSD, HDD, SSD, Network Attached Storage
  • Hardware information is included in output filenames and benchmark results

Multi-Location, Multi-Scenario Execution

  • Run all benchmark combinations with a single command
  • Automatically generates datasets at each location if needed
  • Output files are named systematically: {recipe}-{scenario}-{hardware}.json
  • Example: shape_small-indexing-discovery-nvme.json

Smart Path Handling

  • The benchdata/ prefix in recipes is handled intelligently
  • Specify /tmp/benchdata as location, and it creates /tmp/benchdata/shape_small (not /tmp/benchdata/benchdata/shape_small)
  • Works seamlessly with external drives and network volumes

Enhanced Reporting

  • CSV reporter shows all individual runs (not aggregated)
  • Results are sorted by phase → hardware → recipe for easy comparison
  • Hardware labels are human-readable (e.g., "External HDD (Seagate)")

Best Practices

  • For comprehensive benchmarking across hardware:
    cargo run -p sd-bench -- run-all \
      --locations "/path/to/nvme" "/Volumes/HDD" "/Volumes/SSD" \
      --recipe-filter "^shape_"
    
  • For fast iteration, use smaller recipes (shape_small.yaml) and content_gen.mode: partial.
  • For realistic content identification, set magic_headers: true and content_gen.mode: partial or full for a subset of files.
  • Keep seeds fixed in CI to avoid result variance.

CI Integration

  • Add a job that runs a tiny recipe end-to-end and uploads the JSON summary artifacts (and optionally logs) for inspection.
  • Suggested command:
    • cargo run -p sd-bench -- run --scenario indexing-discovery --recipe benchmarks/recipes/nvme_tiny.yaml --out-json benchmarks/results/ci-indexing-discovery.json

Troubleshooting

  • “Files look empty / zeros”: ensure your recipe has files.content_gen defined with mode: partial or full, and consider magic_headers: true.
  • “Unknown scenario”: run with --scenario indexing-discovery or add your scenario to scenarios/registry.rs.
  • “No recipes found”: check --recipes-dir path and that files end with .yaml or .yml.

Extending the Suite

  • Add a generator: implement DatasetGenerator in benchmarks/src/generator/, register in generator/registry.rs.
  • Add a reporter: implement Reporter in benchmarks/src/reporting/, register in reporting/registry.rs.
  • Add a scenario: see the Scenarios section above.

References

  • CLI entrypoint and commands: benchmarks/src/bin/sd-bench-new.rs, benchmarks/src/cli/commands.rs
  • Dataset generation: benchmarks/src/generator/filesystem.rs
  • Recipe schema: benchmarks/src/recipe/schema.rs
  • Scenarios: benchmarks/src/scenarios/
  • Runner: benchmarks/src/runner/mod.rs
  • Metrics: benchmarks/src/metrics/mod.rs
  • Reporting: benchmarks/src/reporting/
  • Isolated core boot: benchmarks/src/core_boot/mod.rs

Future Benchmarks & Roadmap

The suite is designed to grow into a comprehensive performance harness that reflects the whitepaper and system goals.

  • Indexing pipeline

    • Content identification (done): measure content-only throughput using phase timings.
    • Deep indexing: include thumbnail generation and metadata extraction; track throughput and error rates.
    • Rescan/change detection: cold vs warm cache; latency from change to consistency.
  • File operations

    • Copy throughput: large vs small files, overlap detection, progressive copy correctness; bytes/s and resource usage.
    • Delete/cleanup: large tree deletion, DB cleanup cost, vacuum.
    • Validation/integrity: CAS verification throughput; corruption handling.
  • Duplicates & de-duplication

    • Duplicate detection: time to detect N duplicates; content-identity correctness; DB write pressure.
  • Search & querying

    • (If applicable) index build time and query latency (P50/P95); warm vs cold cache comparisons.
  • Media pipeline

    • Thumbnail generation: per-kind throughput; GPU/CPU offload if available.
    • Metadata extraction: EXIF/FFprobe across formats.
  • Networking & transfer

    • Pairing: time-to-pair and success rate under various conditions.
    • Cross-device transfer: LAN/WAN throughput and latency; concurrency sweeps.
  • Volume & system

    • Volume detection and tracking: discovery latency; multi-volume scaling.
    • Disk type profiling: HDD vs NVMe vs network FS; impact on indexing and copy.

Data generation enhancements

  • Media synthesis: small valid PNG/JPG/WebP; short MP4/AAC clips.
  • Rich content sets: archives (ZIP/TAR), PDFs, docs, code, text; symlinks/permissions; nested trees.
  • Change-set support: scripted add/modify/delete to exercise rescan.
  • Ground-truth manifests: emitted metadata (size, hash) to validate correctness.

Metrics & telemetry

  • Structured metrics export from jobs (avoid log parsing).
  • System snapshot per run: CPU/RAM, disk model/FS, OS; thermal state if available.
  • Resource usage: CPU%, RSS/peak, IO bytes/IOPS.

Reporting & analysis

  • Markdown/CSV reporters; baseline-diff mode for regression detection.
  • HTML dashboard for trend charts over time/history.

CLI ergonomics

  • --list-scenarios, --list-reporters; recipe filters; scenario parameters (mode, scope, concurrency).
  • --timeout, --retries, --clean/--reuse; max parallelism; sharding.

CI integration

  • PR smoke tests: tiny recipes for key scenarios; upload JSON/logs.
  • Nightly heavy runs on tagged hardware; publish time-series metrics.
  • Regression gates: fail PRs on significant metric regressions.