spacedrive/.tasks/AI-002-create-finetuning-dataset.md

---
id: AI-002
title: "Create Fine-Tuning Dataset for AI Agent"
status: To Do
assignee: james
parent: AI-000
priority: Medium
tags:
  - ai
  - llm
  - training-data
  - research
whitepaper: "Section 4.6: AI-Native VDFS"
---

## Description

To enhance the AI-native capabilities of Spacedrive, we need to fine-tune a Large Language Model (LLM) to understand the concepts, commands, and architecture of the system. This task involves creating a high-quality dataset for this purpose.

The dataset will enable the AI agent to answer user questions accurately and translate natural language commands into structured API calls.

## Implementation Notes

The dataset will be created as a JSONL file (`training_data.jsonl`), where each line is a JSON object representing a training example. We will generate two primary types of examples:

### 1. Question-Answer (QA) Pairs

These pairs will teach the model the fundamental concepts of Spacedrive. They will be generated by parsing the whitepaper and technical documentation.

**Example:**
```json
{
  "type": "qa",
  "question": "What is the dual purpose of the Content Identity system in Spacedrive?",
  "answer": "Spacedrive's Content Identity system serves a dual purpose: it eliminates storage waste through intelligent deduplication and simultaneously acts as a data guardian by tracking redundancy across all devices, turning content identification into a holistic data protection strategy."
}
```

### 2. Natural Language to GraphQL Pairs

These pairs will train the model to act as a "semantic parser," translating user requests into API calls for a hypothetical GraphQL endpoint.

**Example:**
```json
{
  "type": "text-to-graphql",
  "natural_language_query": "find videos larger than 1GB that I modified in the last month, newest first",
  "graphql_query": "query { searchEntries(filter: { contentKind: { eq: \"video\" }, size: { gt: 1073741824 }, modifiedAt: { gte: \"2025-08-03T00:00:00Z\" } }, sortBy: { field: modifiedAt, direction: DESC }) { edges { node { id, name, size, modifiedAt } } } }"
}
```

### Dataset Size Recommendations

- **Proof of Concept:** 100 - 500 examples to validate the approach.
- **Usable Prototype:** 1,000 - 5,000 examples for a reliable internal tool.
- **Production-Ready:** 10,000+ examples for a robust user-facing feature.

We will start by creating a Proof of Concept dataset.

## Acceptance Criteria

- A `training_data.jsonl` file is created in the project root.
- The file contains at least 300-500 high-quality examples.
- The dataset includes a mix of both QA and text-to-GraphQL pairs.
- The data is sufficient to begin an initial fine-tuning experiment.