docs: Remove Agent Manager Design Document and Update Whitepaper

- Deleted the Agent Manager Design document to streamline documentation and focus on the new extension-based agent architecture.
- Updated the whitepaper to reflect the transition to an extension-based agent architecture, detailing the capabilities of specialized AI agents implemented as WASM extensions.
- Revised sections to emphasize the event-driven processing, memory systems, and safety mechanisms of the new agent architecture.
- Enhanced clarity on the integration of agents within the VDFS and their roles in proactive file management and user assistance.
This commit is contained in:
Jamie Pine 2025-10-11 01:21:36 -07:00
parent eb06c0a5cb
commit 37772b7e2c
5 changed files with 67 additions and 296 deletions

View File

@ -1,235 +0,0 @@
# Agent Manager Design
## Overview
The **Agent Manager** is a core component of Spacedrive's AI-native architecture. It is responsible for creating, managing, and orchestrating AI agents that can perform intelligent file management tasks. These agents translate natural language commands into verifiable operations and can proactively suggest optimizations based on user behavior.
Agents execute tasks by generating commands for Spacedrive's native **Command Line Interface (CLI)**, which is bundled with the application and communicates with the running core via a secure IPC channel. This ensures all AI-driven operations are subject to the same safety and permission constraints as those performed by a human user.
## Core Components
- **Agent Manager:** The central singleton responsible for the agent lifecycle. It initializes agents, assigns them tasks, and manages their access to system resources.
- **Agent Trait:** A common Rust `trait` that defines the capabilities of any agent, such as receiving a task, processing it, and reporting the outcome.
- **LLM Provider Interface:** A pluggable interface for communicating with different Large Language Models, supporting both local providers (like Ollama) and cloud-based services.
- **Execution Coordinator:** This component is responsible for spawning the `sd` CLI sidecar process with the commands formulated by an agent. It manages the process lifecycle and captures its output (stdout/stderr) for the agent to observe and parse.
## Agentic Loop (Observe-Orient-Decide-Act)
Each agent operates on a continuous loop, enabling it to handle complex, multi-step tasks.
1. **Observe:** The agent analyzes the current state. This includes the initial user query, the file system state (via the VDFS index), and the results of previous actions. A key part of observation is **parsing the `stdout` from previously executed CLI commands** and, if necessary, querying the audit log via `sd log` to confirm the outcome.
2. **Orient:** The agent uses an LLM to interpret the observed state and understand the context of its task.
3. **Decide:** The agent formulates a plan, which may consist of one or more steps. For example, to "archive old projects," the plan would be to first _find_ the projects and then _move_ them.
4. **Act:** The agent executes the next step in its plan. In this architecture, "acting" means **generating a specific, validated command for the `sd` CLI and instructing the Execution Coordinator to run it.**
## LLM Integration and Tooling
### CLI as the Primary Tool Interface
The integration between the AI layer and the Spacedrive core is intentionally decoupled for security and modularity. Instead of direct function calls, agents use the `sd` CLI as their sole tool.
- **Bundling and Availability:** The `sd` CLI is packaged as a **Tauri sidecar** and bundled with every Spacedrive installation. This ensures that the Rust core can always locate and execute a compatible version of the CLI.
- **IPC Communication:** The architecture uses a daemon-client model. The main Spacedrive core runs a **JSON-RPC server** on a local socket. When the Execution Coordinator spawns the `sd` CLI process, the CLI detects the running daemon, connects as a client, and transmits its command for execution.
- **Schema-Aware Prompting:** The Agent Manager automatically provides the LLM with the CLI's schema, help text, and usage examples within the context prompt. This **few-shot learning** approach enables the LLM to correctly format commands for the available tools without requiring specialized fine-tuning.
- **Safety and Sandboxing:** This model creates a robust security boundary. The LLM's capabilities are strictly limited to what the CLI exposes, which in turn is governed by the safe, verifiable, and auditable **Transactional Action System**. The AI has no direct access to internal state or the database.
## Key Data Structures
```rust
/// Represents a task assigned to an agent by the manager.
pub struct AgentTask {
pub id: Uuid,
pub user_prompt: String, // e.g., "organize my tax documents"
pub status: TaskStatus,
pub commands_executed: Vec<String>, // Log of CLI commands run
pub results: String, // Final summary for the user
}
/// The current status of an agent's task.
pub enum TaskStatus {
Pending,
Running,
AwaitingConfirmation(ActionPreview), // Paused for user approval
Completed,
Failed(String),
}
```
## Security and Sandboxing
Security is a primary design consideration. By forcing all AI-driven operations through the CLI, we ensure the LLM operates within the same permission and validation boundaries as a human user. It has no direct access to internal state or database connections, significantly reducing the attack surface. Every action taken by an agent is auditable through the standard system log, just like any other action.
### Agent Long-Term Memory
To enable learning and context retention across tasks, each agent instance is provided with its own private, persistent memory space managed directly by Spacedrive.
- **Memory as a Virtual Directory:** When an agent is first initialized, the Agent Manager creates a dedicated, hidden directory for it within the `.sdlibrary` package (e.g., `.sdlibrary/agents/<agent-id>/`). This directory is the agent's exclusive long-term memory store.
- **Structured Memory Files:** Within this directory, an agent maintains a set of files to store different types of information:
- `scratchpad.md`: For short-term thoughts and planning during a multi-step task.
- `conversation_history.json`: A log of past interactions and outcomes, helping it understand user intent over time.
- `learned_preferences.toml`: A file to store inferred user preferences (e.g., `default_export_format = "png"` or `project_archive_location = "sd://nas/archive/"`).
- **Access via CLI:** The agent interacts with its own memory using the standard `sd` CLI, but with special permissions that restrict file operations to its own sandboxed directory. This allows it to "remember" past actions and user feedback to improve its performance on future tasks.
### Agent Permissions and Scopes
To ensure security and user control, agents do not have unrestricted access to the entire VDFS. Each agent operates within a clearly defined **permission scope** for the duration of its task.
- **Scoped Access by Default:** Before an agent is activated, the Agent Manager, in conjunction with the user's request, defines its access rights. The user is prompted for consent if the required permissions are sensitive. For example: _"This agent needs read/write access to your 'Photos' Location to proceed. Allow?"_
- **Types of Permission Scopes:**
- **Location-Based Scope:** The agent can only read or write within specified Locations (e.g., an "Ingestion Sorter" agent may only access `~/Downloads` and `~/Documents`).
- **Tag-Based Scope:** The agent's operations are restricted to files matching a certain tag (e.g., an agent can only modify files tagged `#ProjectX`).
- **Ephemeral Scope:** The agent is only granted permission to operate on a specific list of files returned from an initial search query.
- **Enforcement by the Core:** These permissions are not merely advisory. When the daemon receives a command from an agent-initiated CLI process, the **Transactional Action System** first verifies that the target of the action (e.g., the source and destination paths) falls within the agent's authorized scope. If it does not, the action is rejected before execution.
### The Agentic Loop with User-in-the-Loop Approval
The agent's primary strength is its ability to perform complex discovery and planning, culminating in a single, large-scale action plan that the user can approve. This leverages the **Transactional Action System's** "preview-before-commit" philosophy, ensuring the user is always in control.
**User Prompt:** _"Organize all my photos tagged 'Hawaii Vacation' into folders by year."_
1. **Discovery and Planning (Observe & Orient):** The agent first runs a series of non-destructive "read" operations to gather all necessary information.
- It finds all relevant files: `sd search --tag "Hawaii Vacation" --type image`
- It then iterates through the results, fetching metadata for each: `sd meta get <path> --select exif.date_time`
- During this phase, the agent builds a complete plan in its internal memory (its "scratchpad"). It determines that 50 files need to be moved into three new directories (`2022`, `2023`, `2024`).
2. **Formulate a Batch Action (Decide):** Instead of executing 50 individual moves, the agent constructs a **single, batch `FileMoveAction`**. This action encapsulates the entire plan: moving all 50 source files to their calculated final destinations.
3. **Generate a Preview for Approval (Act & Await):** This is the key step. The agent's first "Act" is not to execute the move, but to ask the system to **simulate** it.
- It submits the batch action to the **Action System** with a preview flag.
- The system returns a detailed `ActionPreview`, showing exactly what will happen: which folders will be created, which files will be moved, and that no files will be overwritten.
- The agent's task status now changes to `AwaitingConfirmation`, and it presents this preview to the user in the UI.
4. **Execute on User Approval (Final Act):** Once the user reviews the plan and clicks "Confirm," the agent is notified. Only then does it commit the batch action to the durable job queue for execution. The user's approval is the explicit trigger for the final, decisive action.
This workflow is directly supported by the system's data structures, particularly the `TaskStatus` enum, which includes a state specifically for this purpose:
```rust
pub enum TaskStatus {
Pending,
Running,
AwaitingConfirmation(ActionPreview), // Paused for user approval
Completed,
Failed(String),
}
```
This ensures that while the agent provides powerful automation, the user always has the final say before any significant changes are made to their files, perfectly aligning with Spacedrive's core principles of safety and user control.
Excellent point. Making the agent's activity transparent is key to building user trust and creating a dynamic user experience.
Here is a new section for the design document that outlines how to achieve this "follow" mode.
---
### Agent Activity Events and UI "Follow" Mode
To make the AI feel like a first-class citizen within the application, the UI needs to be able to observe and react to an agent's activity in real-time. This is accomplished by tagging agent-initiated commands and emitting specific events that the frontend can subscribe to.
- **Agent-Aware CLI Invocation:**
When the **Execution Coordinator** spawns the `sd` CLI sidecar for an agent, it injects a unique **Task ID** into the new process's environment. When the CLI connects to the core daemon via IPC, it passes this Task ID along with its command. This allows the core to differentiate between commands initiated by a human user and those initiated by a specific agent task.
- **Dedicated Agent Event Stream:**
Once the core identifies a command as part of an agent task, it emits structured events to the main **EventBus**. The frontend can listen for these specific events to power a "follow mode."
```rust
// Example of events the frontend can subscribe to
pub enum AgentActivityEvent {
/// Emitted when an agent's plan is ready for user review.
AwaitingApproval {
task_id: Uuid,
preview: ActionPreview,
},
/// Emitted when an agent runs a new command.
CommandExecuted {
task_id: Uuid,
command_string: String, // e.g., "sd search --type image"
},
/// Emitted when the agent receives a result from a command.
CommandResultReceived {
task_id: Uuid,
stdout: Vec<String>,
},
/// Emitted when the agent provides a final summary.
TaskCompleted {
task_id: Uuid,
summary: String,
},
}
```
- **Frontend "Follow Mode":**
By subscribing to this event stream, the frontend can create a rich, interactive experience:
- **Live Console:** A dedicated panel can show a real-time log of the commands the agent is executing and the results it's receiving.
- **UI Highlighting:** The main file browser can visually highlight the files and folders the agent is currently processing.
- **Interactive Prompts:** When an `AwaitingApproval` event is received, the UI can display the `ActionPreview` in a modal, allowing the user to approve or deny the agent's plan.
- **Progress Indicators:** For long-running batch operations, the UI can display progress bars and status messages, giving the user clear insight into the agent's work.
Of course. Here is a new section for your Agent Manager design document that explains the asynchronous search architecture and the necessary changes to support it.
---
### 5\. Asynchronous, Agent-Driven Search
To enhance the user experience and create a more scalable, consistent, and responsive system, all search functionality will be unified into a single asynchronous workflow. This design applies to searches initiated by both the AI agent and directly by the user in the UI. It moves away from a direct query-response model and instead leverages the Job System to provide a more robust and interactive search experience.
#### 5.1. The Agent's Role: Initiating and Monitoring
The agent's primary role in the search process is to _initiate_ and _monitor_ the search, rather than directly consuming the results. This creates a clean separation of concerns and allows the agent to continue performing other tasks while the search is in progress.
1. **New `SearchAction`**: A new `Search(String)` variant will be added to the `Action` enum. The `String` payload will contain the user's natural language search query.
2. **Dispatching the Action**: To initiate a search, the agent will generate a command for the `sd` CLI, such as:
```bash
sd --action search "find all my photos from my last vacation to Japan"
```
3. **Monitoring Job Status**: The agent will monitor the progress of the search by subscribing to events from the Job System. The agent will receive events like `JobCreated`, `JobInProgress`, and `JobCompleted`, allowing it to track the search without being blocked.
#### 5.2. The Job System: Making Search Asynchronous
The Job System is the core component that enables the asynchronous nature of our search functionality.
1. **`SearchJob`**: When the `ActionManager` receives a `SearchAction`, it will not execute the search directly. Instead, it will register a new `SearchJob` with the `JobManager`.
2. **`query_id` Generation**: The `JobManager` will assign a unique `job_id` to the new `SearchJob`. This `job_id` will also serve as the `query_id` for the search, allowing us to track the search and its results throughout the system.
#### 5.3. Caching and Frontend Interaction
This is where the user-facing part of the design comes into play. The frontend is responsible for fetching and rendering the search results, guided by events from the backend.
1. **Results Caching**: Upon completion, the `SearchJob` will not return the results to the agent. Instead, it will store the results in a cache (e.g., Redis, or a database table), using the `query_id` as the key.
2. **`SearchResultsReady` Event**: Once the `SearchJob` is complete, the `JobManager` will emit a `SearchResultsReady(query_id)` event. This event will be broadcast to all subscribed clients, including the frontend.
3. **New GraphQL Endpoint**: The frontend will have a new GraphQL query, `getSearchResults(queryId: ID!)`.
4. **Rendering the Results**: When the frontend receives the `SearchResultsReady` event (likely via a WebSocket connection), it will use the `query_id` from the event to call the `getSearchResults` GraphQL endpoint. This endpoint will fetch the cached results and the frontend will then render them for the user.
#### 5.4. Required Changes
To support this new asynchronous search architecture, the following changes will be required:
- **Action System (`src/infrastructure/actions/mod.rs`)**:
- Add a new `Search(String)` variant to the `Action` enum.
- **Job System (`src/infrastructure/jobs/mod.rs`)**:
- Create a new `SearchJob` type that encapsulates the logic for performing a search and caching the results.
- The `JobManager` must be able to handle and dispatch `SearchJob`s.
- A new `SearchResultsReady(query_id)` event must be added to the event system.
- **Caching Layer**:
- A new caching mechanism will be needed to store search results. The specific implementation (e.g., Redis, in-memory cache) will need to be determined.
- **GraphQL API (`src/interfaces/graphql/mod.rs`)**:
- A new `getSearchResults(queryId: ID!)` query must be added to the GraphQL schema and resolver. This endpoint will be responsible for fetching the search results from the cache.
- **Frontend**:
- The frontend must be able to subscribe to backend events (e.g., via WebSockets).
- The frontend must be updated to handle the `SearchResultsReady` event and call the new `getSearchResults` GraphQL endpoint to render the results.

View File

@ -59,11 +59,11 @@ Action: Build the two-stage search architecture. First, implement fast temporal
Reasoning: This hybrid approach provides the speed of keyword search with the power of semantic understanding, achieving sub-100ms queries on consumer hardware as specified in the whitepaper.
2. Develop the AI Agent Manager:
2. Implement Extension-Based Agent System:
Action: Build the core Agent Manager as described in the design document. This manager will be responsible for orchestrating AI tasks and implementing the agentic loop (Observe, Orient, Decide, Act).
Action: Build the WASM extension runtime and SDK that enables specialized AI agents. This includes the agent context, memory systems (Temporal, Associative, Working), event subscription mechanism, and integration with the job system.
Reasoning: This provides the central "brain" for the AI system, capable of translating natural language queries and proactive observations into structured, verifiable Actions.
Reasoning: This provides the foundation for domain-specific intelligence through secure, sandboxed extensions. Each agent (Photos, Finance, Storage, etc.) can maintain its own knowledge base and react to VDFS events while using the same safe, transactional primitives as human users.
3. Implement the Virtual Sidecar System:

View File

@ -8,7 +8,7 @@ The architecture is founded on four key principles:
1. [cite_start]**A Unified Data Model**: A virtual layer that treats all files as content-addressable objects with rich metadata[cite: 17].
2. [cite_start]**Safe & Predictable Operations**: A transactional system where all file operations can be simulated and previewed before execution[cite: 18, 19].
3. [cite_start]**Resilient Synchronization**: A leaderless, hybrid P2P model that avoids distributed consensus by separating data into different ownership domains[cite: 20, 21, 22].
4. [cite_start]**An AI-Native Agent Architecture**: A design where an AI agent can observe the file system index and propose organizational actions using the same safe, transactional model as human users[cite: 23, 24].
4. **Extension-Based Agent Architecture**: A design where specialized AI agents are implemented as WASM extensions with domain-specific knowledge, persistent memory systems, and event-driven processing capabilities. These agents observe the file system index and propose actions using the same safe, transactional primitives as human users.
[cite_start]The reference implementation is written in Rust and demonstrates high performance, with sub-100ms semantic search latency and an approximate memory footprint of 150 MB for libraries containing over one million files[cite: 26].
@ -62,14 +62,16 @@ The indexer is a multi-phase pipeline that builds and maintains the VDFS.
#### 6. AI-Native Architecture
[cite_start]Spacedrive is designed with AI as a core component, not an add-on[cite: 434]. [cite_start]The VDFS index serves as a "world model" that an AI agent can reason about[cite: 435].
Spacedrive is designed with AI as a core component through its extension architecture. Rather than a single monolithic assistant, specialized AI agents are implemented as secure WASM extensions, each with domain-specific knowledge and persistent memory systems.
* **Agentic Loop (Observe, Orient, Act)**:
1. [cite_start]**Observe**: The AI observes the file index, which is enriched with semantic information (OCR text, image tags, video transcripts) generated by background analysis jobs[cite: 444, 445].
2. [cite_start]**Orient**: It analyzes this data and user history (from an audit log) to understand context and identify organizational patterns[cite: 448].
3. [cite_start]**Act**: The AI proposes actions (e.g., `FileCopyAction`, `BatchTagAction`) to the user[cite: 449]. [cite_start]Crucially, it uses the same safe, previewable Transactional Action System, ensuring the user always has final approval[cite: 450, 456].
* [cite_start]**Natural Language & Proactive Assistance**: This allows users to issue commands like "organize my tax documents from last year"[cite: 453]. [cite_start]The AI can also proactively suggest actions, such as offering to back up newly imported photos that lack redundancy or automatically sorting downloaded invoices into the correct folder[cite: 458, 462, 472].
* [cite_start]**Privacy-First AI**: The architecture is model-agnostic and supports local AI models via services like **Ollama**, ensuring user data never has to leave their device for analysis[cite: 438, 496].
* **Extension-Based Agents**: Agents are sandboxed WASM extensions with explicit permissions. A Photos agent, for example, performs face detection and clustering. A Finance agent processes receipts and invoices. Each agent has its own knowledge base and operates independently within a secure sandbox.
* **Memory Systems**: Extension agents maintain three types of persistent memory:
1. **Temporal Memory**: Time-ordered event logs enabling queries like "photos analyzed last week with beach scenes"
2. **Associative Memory**: Semantic knowledge graphs supporting similarity searches and cross-domain queries
3. **Working Memory**: Transactional state tracking for pending work and current operations
* **Event-Driven Processing**: Agents react to VDFS events through lifecycle hooks (`on_startup`, `on_event`, `scheduled`), enabling real-time processing as files are indexed without polling.
* **Safe Actions**: All agent-initiated operations use the same Transactional Action System as human users, with previews, validation, and audit trails. When a Photos agent identifies a face cluster, it generates a `BatchTagAction` complete with a preview showing exactly which files will be tagged.
* **Privacy and Security**: All agent processing occurs locally within WASM sandboxes. Face detection, scene classification, and other AI capabilities run on the user's device with explicit permission controls scoped to specific library locations.
#### 7. Temporal-Semantic Search

Binary file not shown.

View File

@ -219,7 +219,7 @@ The architecture is built on four foundational principles that solve traditional
\item \textbf{Resilient Synchronization: A Leaderless Hybrid Model} (Section~\ref{sec:library-sync}): A synchronization model that eliminates leader-based bottlenecks by separating data into distinct domains. The filesystem index, being device-authoritative, is replicated using a simple state-based model where each device is the source of truth for its own files. Concurrently modified user metadata is merged deterministically using a lightweight, per-device change log ordered by a Hybrid Logical Clock (HLC).
\item \textbf{AI Agent Architecture} (Section~\ref{sec:ai-native}): An architecture in which an agent can observe the VDFS index and propose actions using the same transactional model as human-initiated operations. The agent follows an "Observe, Orient, Act" loop and generates previewable Actions for user approval.
\item \textbf{AI Agent Architecture} (Section~\ref{sec:ai-native}): An architecture where specialized AI agents are implemented as WASM extensions, each with domain-specific knowledge and persistent memory systems. These agents observe the VDFS index, react to events, and propose actions using the same transactional model as human-initiated operations.
\end{itemize}
We report a reference implementation in Rust. On consumer hardware, we observe sub-100 ms semantic search latency, synchronization via the Iroh networking stack, and an approximate memory footprint of 150 MB for libraries exceeding one million files.
@ -1681,60 +1681,62 @@ This transparent, user-controlled approach to conflict resolution ensures that u
\begin{keytakeaways}
\begin{itemize}[noitemsep, topsep=0pt]
\item \textbf{AI-Native Design}: Your file index becomes a complete "world model" that AI agents can understand and reason about
\item \textbf{Natural Language}: Say "organize my tax documents" and the AI converts your intent into structured actions via the Action System
\item \textbf{Privacy-First AI}: Run models locally with Ollama for complete privacy, or use cloud AI with transparent controls
\item \textbf{Extension-Based Agents}: AI agents are implemented as sandboxed WASM extensions with domain-specific knowledge and persistent memory
\item \textbf{Event-Driven Intelligence}: Agents react to filesystem events, scheduled triggers, and user requests through a unified event system
\item \textbf{Privacy-First AI}: All agent processing runs locally within secure WASM sandboxes, with fine-grained permission controls
\end{itemize}
\end{keytakeaways}
While many systems treat AI as an additive feature, Spacedrive is architected as an \textbf{AI-native data space}. The complete, continuously updated index of the user's files serves as a practical "world model" for an AI agent to reason about. This enables a shift from reactive file management (issuing manual commands) to a collaborative model where both the user and an AI agent can manage the data space, with the human in the loop. This builds upon research in semantic file systems~\cite{gifford_sfs_1991}, information retrieval~\cite{dumais_stuff_2003}, and ubiquitous computing~\cite{weiser_ubiquitous_1991}.
While many systems treat AI as an additive feature, Spacedrive is architected as an \textbf{AI-native data space} where specialized intelligence is deeply integrated through its extension architecture. Rather than a single monolithic AI assistant, Spacedrive enables domain-specific agents—such as a Photos agent for face recognition and place clustering, or a Finance agent for receipt processing—each implemented as a secure WASM extension with its own knowledge base and capabilities. This modular approach builds upon research in semantic file systems~\cite{gifford_sfs_1991}, information retrieval~\cite{dumais_stuff_2003}, and ubiquitous computing~\cite{weiser_ubiquitous_1991}.
This is achieved through a flexible, privacy-first architecture that is model-agnostic, supporting both cloud services and local models running on user hardware via interfaces like Ollama.
Extensions provide a secure, sandboxed environment where agents can observe the continuously updated VDFS index, react to events, and propose actions through the same safe, transactional primitives available to human users. This architecture enables a shift from reactive file management to a collaborative model where specialized agents assist with domain-specific tasks while keeping the user in control.
\subsubsection{The AI-Native Advantage in Practice}
\subsubsection{Extension-Based Agents in Practice}
To illustrate, consider a designer, Alice. She asks Spacedrive: ``Find my untagged design projects from last fall.'' The AI agent observes that her project files from that period lack organization tags, a pattern it learned from her past actions stored in the audit log. Instead of just listing files, it decides on an action, proposing a pre-visualized \texttt{BatchTagAction} to organize these files with appropriate project tags. Later, observing her manually moving screenshots from Downloads to project folders, it proactively suggests a \texttt{FileCopyAction} automation rule. This is assistive automation that adapts while keeping Alice in control.
Spacedrive's agent architecture is demonstrated through specialized extensions like the Photos extension. When a user enables the Photos extension on their library, an intelligent agent begins operating in the background. As new photos are indexed, the agent receives \texttt{EntryCreated} events and adds them to an analysis queue. When the queue reaches a threshold (e.g., 50 photos), the agent dispatches a \texttt{AnalyzePhotosBatchJob} that runs face detection, scene classification, and place identification. The agent maintains persistent memory of face clusters, place associations, and scene patterns, enabling sophisticated queries like ``show me photos of Alice at the beach.''
\subsubsection{The Agentic Loop: Observe, Orient, Act}
Each extension agent is sandboxed within a WASM runtime with explicit permissions. The Photos extension, for example, requires permissions to read image files, write sidecars for face detection results, add tags, and use local face detection models. Users grant these permissions on a per-location basis, ensuring the agent only processes photos in directories they've explicitly authorized.
Spacedrive's AI capabilities are built on a classic agentic loop, where each stage is powered by a core component of the VDFS architecture:
\subsubsection{The Agent Architecture: Memory, Events, and Actions}
\textbf{Observe}: The Indexing System is the sensory input. During the \textbf{deep indexing phase}, it goes beyond basic metadata to perform AI-powered analysis, extracting rich context like image content, video transcripts, and document summaries. This enriches the Spacedrive index, providing the AI with a deep understanding of the user's data.
Extension-based agents operate through three core primitives:
\textbf{Orient}: With a complete "world model" in its index, the AI can orient itself. It analyzes file content, user-applied metadata (tags, ratings), and historical user actions (from the audit\_log table) to understand context, identify patterns, and recognize organizational inconsistencies.
\textbf{Memory Systems}: Each agent maintains persistent knowledge through specialized memory types. \textit{Temporal Memory} stores time-ordered events (e.g., "photo analyzed, 2 faces detected, scene: beach"), enabling queries like "photos analyzed last week with beach scenes." \textit{Associative Memory} maintains a semantic knowledge graph (e.g., face clusters, place associations) that supports similarity searches and cross-domain queries like "places where Alice frequently appears." \textit{Working Memory} tracks current state and pending work transactionally.
\textbf{Decide \& Act}: The AI formulates a plan and proposes it as a structured Action. This is a critical safety and control mechanism; the AI does not execute arbitrary commands but is constrained to the same safe, verifiable primitives available to the user. A user command like "Archive my old projects from last year that are over 1GB" is translated directly into a FileCopyAction.
\textbf{Event-Driven Processing}: Agents react to the VDFS through lifecycle hooks: \texttt{on\_startup} initializes the agent, \texttt{on\_event} handlers respond to filesystem events like \texttt{EntryCreated}, and \texttt{scheduled} tasks run on cron-like schedules. This reactive architecture enables agents to maintain up-to-date knowledge without polling, processing new data as it arrives.
\subsubsection{Natural Language Management}
\textbf{Safe Actions}: Agents propose changes through the same Action System used by human users. When the Photos agent identifies a face cluster, it generates a \texttt{BatchTagAction} to tag relevant photos, complete with a preview showing exactly which files will be modified. All agent-initiated actions are auditable, reversible, and require the same validation as user commands.
The Action System serves as a stable, well-defined API that can be used to fine-tune language models. This allows Spacedrive to translate complex user requests from natural language into a series of verifiable actions.
\subsubsection{Domain-Specific Intelligence}
As we saw with Alice's request to ``find design assets from last fall that I never exported,'' the system translates natural language into precise operations. Similarly, a command like ``Move my last 3 screen recordings from the desktop to the 'Clips' folder on my NAS'' is processed through semantic search to identify the relevant files, then translated into a structured \texttt{FileCopyAction} with appropriate source paths, destination, and move semantics.
The extension architecture enables specialized agents tailored to specific domains rather than a single general-purpose assistant. A Finance extension can implement intelligent receipt processing: when a PDF invoice appears in an Ingest location, the Finance agent's \texttt{on\_event} handler triggers OCR extraction, uses a language model to parse merchant details and amounts, then proposes a \texttt{FileMoveAction} to organize it into the appropriate \texttt{/Finances/Vendors/[Merchant]} directory structure.
The generated action is processed through the Action System (Section~\ref{sec:action-system}), inheriting its safety guarantees including preview and durability. The AI serves as an interpreter rather than an opaque automaton.
Similarly, a Photos extension implements sophisticated computer vision workflows: face detection with clustering, scene classification, place identification from GPS data, and automatic moment generation. Each domain-specific agent brings specialized models and processing logic while operating within the secure, permission-controlled WASM sandbox.
\subsubsection{Proactive Assistance and Optimization}
This modular approach provides several advantages over monolithic AI assistants. Agents can be independently developed, tested, and updated. Users install only the agents relevant to their workflows. Each agent's permissions are explicitly scoped, enhancing security. And domain-specific knowledge enables more accurate and efficient processing than general-purpose models.
Beyond executing commands, the AI agent can proactively identify opportunities to help the user. By observing patterns, it can suggest helpful actions.
\subsubsection{Proactive Pattern Recognition}
\textbf{Organizational Suggestions}: As demonstrated in Alice's workflow, when the AI observed her repeatedly moving screenshots from the Desktop to project folders, it proactively offered to automate this pattern. The architecture enables such capabilities---if the indexer identifies a screen recording on the Desktop and the agent observes from historical actions that the user consistently moves such files to a \texttt{\textasciitilde/Videos/Screen Recordings} folder, it could generate a suggested \texttt{FileCopyAction} for the user to approve with a single click.
Extension agents leverage their persistent memory systems to identify patterns and suggest optimizations. An Organization agent can observe repeated user actions in the \texttt{audit\_log}—such as consistently moving screenshots from Desktop to project-specific folders—and proactively offer automation rules. When a new screenshot appears, the agent proposes a \texttt{FileCopyAction} with a confidence score based on historical patterns: "I noticed you typically move screenshots like this to \texttt{/Projects/DesignWork/Screenshots}. Would you like me to move this one?"
\textbf{Deduplication Opportunities}: The agent can periodically scan for duplicated content across devices and suggest a "cleanup" action that consolidates files and frees up space, with the Action System showing exactly what will be removed.
\textbf{Deduplication Intelligence}: A Storage agent can periodically scan for duplicated content across devices and suggest cleanup actions that consolidate files and free space, with the Action System's preview showing exactly which files will be removed and where backups exist.
\textbf{Data Guardian Mode}: The AI leverages the Data Guardian capability (Section~\ref{sec:content-identity}) to monitor file redundancy. When Alice imports her daughter's graduation photos, the system detects these as single-copy files. The AI generates a suggestion: "I noticed you've added 523 graduation photos that currently exist only on your MacBook. These precious memories could be lost if your laptop fails. Would you like me to create backups on your Home NAS and Cloud Storage?"
The AI system analyzes user behavior patterns from the \texttt{audit\_log} table to identify organizational preferences, then suggests actions when files violate established patterns. Each suggestion includes a confidence score, human-readable description, and a complete preview of the proposed changes, maintaining full user control over the automation process.
\textbf{Data Guardian Monitoring}: A Backup agent leverages the Content Identity system (Section~\ref{sec:content-identity}) to monitor file redundancy. When new photos are imported with zero redundancy, the agent generates a suggestion: "I noticed you've added 523 photos that currently exist only on your MacBook. These files could be lost if your device fails. Would you like me to create backups on your NAS and Cloud Storage?" The agent maintains knowledge of backup preferences and device availability in its associative memory, enabling intelligent, context-aware suggestions.
\paragraph{Ingestion Sorting}
\paragraph{Ingestion Processing}
\label{sec:ai-native-ingestion}
The AI agent's proactive capabilities are particularly powerful when applied to the file ingestion workflow (Section~\ref{sec:ingestion-workflow}). When new files arrive in the user's designated Ingest Location, the AI's agentic loop is triggered:
Extension agents are particularly powerful when monitoring Ingest Locations (Section~\ref{sec:ingestion-workflow}). A Finance agent, for example, can register an \texttt{on\_event} handler for \texttt{EntryCreated} events in designated ingest directories. When a new PDF arrives, the agent:
\begin{itemize}[noitemsep, topsep=0pt]
\item \textbf{Observe}: The AI detects new \texttt{Entry} records in the Ingest Location.
\item \textbf{Orient}: It performs content analysis on the new files (e.g., identifying a PDF as a receipt, a PNG as a screenshot of a specific application) and cross-references this with the user's historical organization patterns from the \texttt{audit\_log} table. For instance, it may notice that PDFs with the word "Invoice" are consistently moved to a "\texttt{/Finances/Invoices}" directory.
\item \textbf{Decide \& Act}: Based on this analysis, the AI formulates and proposes a \texttt{FileCopyAction} or \texttt{FileMoveAction} to the user. The user is presented with a clear suggestion: "I noticed a new invoice landed in your Inbox. Would you like me to move it to your 'Invoices' folder?". This suggestion is a standard, pre-visualized Action that the user can approve with a single click, ensuring human-in-the-loop control over all automated organization.
\item Receives the \texttt{EntryCreated} event with the new Entry
\item Dispatches an OCR job to extract text content
\item Analyzes the extracted text to identify the document type (invoice, receipt, statement)
\item Queries its associative memory for similar documents and their historical organization patterns
\item Proposes a \texttt{FileMoveAction} to the appropriate destination directory
\end{itemize}
The user sees a notification: "New invoice from Acme Corp detected. Move to \texttt{/Finances/Invoices/2024/Acme}?" This suggestion is a standard, pre-visualized Action that can be approved with a single click, maintaining human-in-the-loop control while automating repetitive organization tasks.
\subsubsection{File Intelligence via Virtual Sidecars}
The AI agent's ability to "Observe" the user's data space is powered by the Virtual Sidecar System. The background intelligence jobs use purpose-specific models to enrich the VDFS with structured information:
Extension agents enrich the VDFS through the Virtual Sidecar System. Background jobs dispatched by agents use purpose-specific models to extract structured information:
\begin{itemize}
\item \textbf{Text Embeddings}: Lightweight models like all-MiniLM-L6-v2 for semantic search
\item \textbf{OCR}: Tesseract or EasyOCR for text extraction
@ -1745,29 +1747,31 @@ These specialized models are far more efficient than general-purpose LLMs while
\textbf{Image Object Extraction}: An \texttt{ImageAnalysisJob} processes image files. Using a multimodal model, it identifies objects and concepts within the image (e.g., "dog," "beach," "sunset"). These results are not stored in a sidecar, but are instead applied directly as Tags to the Entry's \texttt{UserMetadata} record. This integrates AI analysis into the user's organizational structure and makes images searchable via existing tag filters.
\textbf{OCR and Transcription}: For images and PDF documents, an \texttt{OcrJob} is triggered. It extracts all textual content and saves it to a structured sidecar file (e.g., \texttt{ocr.json}). Similarly, a \texttt{TranscriptionJob} uses a speech-to-text model on audio and video files to produce a \texttt{transcript.json} sidecar. The text content from these sidecars is then ingested into the Temporal-Semantic Search FTS5 index, making the content of non-text files fully searchable. A user can now find a photo of a receipt by searching for the vendor's name, or find a video by searching for a phrase spoken within it.
\textbf{OCR and Transcription}: For images and PDF documents, an \texttt{OcrJob} is triggered by relevant agents. It extracts all textual content and saves it to a structured sidecar file (e.g., \texttt{ocr.json}). Similarly, a \texttt{TranscriptionJob} uses a speech-to-text model on audio and video files to produce a \texttt{transcript.json} sidecar. The text content from these sidecars is then ingested into the Temporal-Semantic Search FTS5 index, making the content of non-text files fully searchable. A user can now find a photo of a receipt by searching for the vendor's name, or find a video by searching for a phrase spoken within it.
This system transforms a simple collection of files into a rich, interconnected knowledge base that the AI agent can reason about, all while maintaining a local-first, privacy-preserving architecture.
This system transforms a simple collection of files into a rich, interconnected knowledge base that extension agents can reason about, all while maintaining a local-first, privacy-preserving architecture within secure WASM sandboxes.
\subsubsection{\plannedSection{AI-Driven Tiering Suggestions}}
\subsubsection{\plannedSection{Intelligent Storage Tiering}}
The VDFS's native understanding of \texttt{StorageClass} provides a foundation for AI assistance. Instead of managing storage in an opaque way, the AI agent's role is to analyze access patterns and suggest changes to a Location's core \texttt{StorageClass} property.
The VDFS's native understanding of \texttt{StorageClass} enables Storage extension agents to analyze access patterns and suggest tiering changes. A Storage agent can monitor file access metadata and maintain temporal memory of usage patterns across the library.
Consider Bob, a photographer: Spacedrive's AI notices that RAW photo shoots from 2023, tagged as "delivered," haven't been accessed in months.
Consider Bob, a photographer with a Storage agent installed: The agent's scheduled task runs weekly, querying its temporal memory for files with low access frequency. It identifies RAW photo shoots from 2023, tagged as "delivered," that haven't been opened in six months.
\textbf{Action Proposal}: "I can re-classify 8 completed photo shoots (1.2TB) as \textbf{Cold Storage}, moving them to your NAS archive. This will free up space on your main SSD. These files will remain fully searchable, but access will take longer. Do you approve?"
\textbf{Agent Proposal}: "I can re-classify 8 completed photo shoots (1.2TB) as \textbf{Cold Storage}, moving them to your NAS archive. This will free up space on your main SSD. These files will remain fully searchable, but access will take longer. Do you approve?"
When Bob approves, a standard \texttt{FileCopyAction} is generated and committed to the durable job queue. The AI acts as an advisor, but the operation itself uses the safe, transparent, and verifiable primitives of the VDFS and Action System.
When Bob approves, the agent dispatches a standard \texttt{FileCopyAction} to the job queue. The agent acts as an advisor, but the operation uses the same safe, transparent, and verifiable primitives as manual user commands.
The storage tiering system analyzes access patterns and storage costs to suggest \texttt{StorageClass} assignments. When the AI detects that files in a \texttt{Hot} location haven't been accessed for extended periods, it can propose reclassification to \texttt{Cold} or \texttt{Deep} storage. Similarly, if files in \texttt{Cold} storage see increased access, the AI can suggest promoting them back to \texttt{Hot} storage. This human-in-the-loop approach ensures users maintain control while benefiting from automation.
Storage agents can also detect the inverse pattern: when files in \texttt{Cold} storage see increased access, the agent can propose promoting them back to \texttt{Hot} storage. This human-in-the-loop approach ensures users maintain control while benefiting from intelligent automation.
\subsubsection{Privacy-First AI Architecture}
\subsubsection{Privacy and Security Through Sandboxing}
This AI framework clearly separates concerns between search (lightweight embeddings) and intelligent assistance (LLMs). For the AI agent functionality—natural language understanding, action generation, and proactive suggestions—users can choose:
The extension-based agent architecture provides strong security guarantees through WASM sandboxing and explicit permission controls. Each extension agent operates in an isolated WASM runtime with access limited to explicitly granted permissions. When a user installs a Photos extension, they must grant specific permissions: read access to image files, write access to face detection sidecars, tag creation privileges, and the right to use local face detection models. These permissions can be further scoped to specific library locations, ensuring the agent only processes files in authorized directories.
The AI provider interface supports multiple deployment models: local processing via Ollama for complete privacy, cloud-based services for enhanced capabilities, and enterprise self-hosted solutions for organizational control. This flexibility ensures users can balance privacy, performance, and functionality according to their specific requirements.
All agent processing occurs locally within the WASM sandbox. Face detection models, scene classifiers, and other AI capabilities run on the user's device, ensuring sensitive data never leaves their control. The extension SDK provides access to local model inference through a secure host function interface that enforces rate limits and resource constraints.
This architecture aims for a personal, private data space—one where AI enhances capability without compromising control or privacy.
This architecture separates concerns between lightweight embedding models for search (which operate on the core VDFS) and specialized agent intelligence (which runs in isolated extensions). The AI provider interface supports multiple deployment models: local processing via Ollama for complete privacy, cloud-based services for enhanced capabilities, and enterprise self-hosted solutions for organizational control.
This approach creates a personal, private data space where AI capabilities enhance functionality without compromising control, privacy, or security.
\paragraph{Ethical Considerations}
While model-agnostic, Spacedrive prioritizes ethical AI use. Local models mitigate bias by training on user data only, but users are notified of potential limitations (e.g., underrepresented demographics in embeddings). Cloud options include opt-out for sensitive files, ensuring compliance with regulations like GDPR.
@ -1779,19 +1783,19 @@ While model-agnostic, Spacedrive prioritizes ethical AI use. Local models mitiga
\begin{keytakeaways}
\begin{itemize}[noitemsep, topsep=0pt]
\item \textbf{Asynchronous by Design}: All searches are durable jobs, preventing UI blocking and enabling agents to initiate searches without waiting for results.
\item \textbf{Asynchronous by Design}: All searches are durable jobs, preventing UI blocking and enabling extension agents to initiate searches without waiting for results.
\item \textbf{Hybrid Architecture}: A high-speed FTS5 keyword filter pre-processes candidates before lightweight semantic re-ranking.
\item \textbf{Decoupled Results}: The backend caches results against a query ID; the frontend is notified to fetch and render them, decoupling the agent from the UI.
\item \textbf{Decoupled Results}: The backend caches results against a query ID; the frontend is notified to fetch and render them, decoupling agents from the UI.
\end{itemize}
\end{keytakeaways}
Spacedrive's search architecture is engineered for a responsive, agent-driven environment. It treats search not as a synchronous request-response loop, but as a durable, asynchronous job that is initiated, executed, and cached by the backend, with the frontend being notified upon completion. This model applies uniformly to searches originating from both direct user input and AI agent directives.
Spacedrive's search architecture is engineered for a responsive, agent-driven environment. It treats search not as a synchronous request-response loop, but as a durable, asynchronous job that is initiated, executed, and cached by the backend, with the frontend being notified upon completion. This model applies uniformly to searches originating from both direct user input and extension agent directives.
\subsubsection{The Search Lifecycle: An Action-Driven Workflow}
A search operation follows a formal lifecycle managed by core VDFS components:
\begin{enumerate}[noitemsep, topsep=0pt]
\item \textbf{Action Dispatch}: A user or agent initiates a search by dispatching a \texttt{SearchAction} containing the query parameters.
\item \textbf{Action Dispatch}: A user or extension agent initiates a search by dispatching a \texttt{SearchAction} containing the query parameters.
\item \textbf{Job Registration}: The \texttt{ActionManager} receives this action and registers a new \texttt{SearchJob} with the Durable Job System, returning a unique \texttt{query\_id}. This step is near-instantaneous, providing immediate feedback.
\item \textbf{Asynchronous Execution}: The \texttt{SearchJob} executes in the background, performing the two-stage search process without blocking the user or agent.
\item \textbf{Result Caching}: Upon completion, the job caches the ordered list of resulting \texttt{Entry} IDs in a temporary store, keyed by the \texttt{query\_id}.
@ -1799,7 +1803,7 @@ A search operation follows a formal lifecycle managed by core VDFS components:
\item \textbf{Result Rendering}: The frontend, subscribed to these events, receives the notification and uses the \texttt{query\_id} to fetch the cached results via the GraphQL API for rendering.
\end{enumerate}
This architecture ensures the agent's role is simply to initiate the search; it never needs to handle the results directly, maintaining a clean separation of concerns.
This architecture ensures an agent's role is simply to initiate the search; it never needs to handle the results directly, maintaining a clean separation of concerns.
\subsubsection{The Two-Stage Search Process}
The \texttt{SearchJob} executes a hybrid temporal-semantic query designed for performance on consumer hardware. This \textbf{Temporal-First, Vector-Enhanced} approach operates in two stages:
@ -1913,8 +1917,8 @@ This architecture provides several key advantages:
The unified format ensures all intelligence—routing and content vectors—travels with the data, while the adaptive creation strategy prevents overhead in sparse areas of the filesystem. This enables million-file semantic search on consumer hardware by transforming an O(n) problem into an O(log n) traversal guided by semantic routing.
\paragraph{Integration with AI Agents}
The Vector Repository system integrates with Spacedrive's AI agents, enabling them to:
\paragraph{Integration with Extension Agents}
The Vector Repository system integrates with Spacedrive's extension agents, enabling them to:
\begin{itemize}
\item Navigate large filesystems using routing hints
\item Understand folder purposes through aggregate embeddings
@ -2608,7 +2612,7 @@ WASM provides critical security guarantees:
Through the exposed VDFS API, plugins can:
\begin{itemize}
\item Define custom semantic content types with parsing logic
\item Create specialized AI agents for workflow automation
\item Implement specialized AI agents with domain-specific knowledge and memory
\item Add new actions to the transactional action system
\item Implement custom search providers and filters
\item Generate specialized thumbnails and previews
@ -2690,7 +2694,7 @@ The WASM plugin system provides flexible extensibility:
\textbf{Plugin Instances}\\[0.2cm]
\textbullet\ Content Type Handlers\\
\textbullet\ Search Providers\\
\textbullet\ AI Agents\\
\textbullet\ Intelligent Agents (WASM)\\
\textbullet\ Custom Actions
};