spacedrive/.tasks/AI-002-create-finetuning-dataset.md
2025-09-06 21:00:37 -04:00

67 lines
2.6 KiB
Markdown

---
id: AI-002
title: "Create Fine-Tuning Dataset for AI Agent"
status: To Do
assignee: james
parent: AI-000
priority: Medium
tags:
- ai
- llm
- training-data
- research
whitepaper: "Section 4.6: AI-Native VDFS"
---
## Description
To enhance the AI-native capabilities of Spacedrive, we need to fine-tune a Large Language Model (LLM) to understand the concepts, commands, and architecture of the system. This task involves creating a high-quality dataset for this purpose.
The dataset will enable the AI agent to answer user questions accurately and translate natural language commands into structured API calls.
## Implementation Notes
The dataset will be created as a JSONL file (`training_data.jsonl`), where each line is a JSON object representing a training example. We will generate two primary types of examples:
### 1. Question-Answer (QA) Pairs
These pairs will teach the model the fundamental concepts of Spacedrive. They will be generated by parsing the whitepaper and technical documentation.
**Example:**
```json
{
"type": "qa",
"question": "What is the dual purpose of the Content Identity system in Spacedrive?",
"answer": "Spacedrive's Content Identity system serves a dual purpose: it eliminates storage waste through intelligent deduplication and simultaneously acts as a data guardian by tracking redundancy across all devices, turning content identification into a holistic data protection strategy."
}
```
### 2. Natural Language to GraphQL Pairs
These pairs will train the model to act as a "semantic parser," translating user requests into API calls for a hypothetical GraphQL endpoint.
**Example:**
```json
{
"type": "text-to-graphql",
"natural_language_query": "find videos larger than 1GB that I modified in the last month, newest first",
"graphql_query": "query { searchEntries(filter: { contentKind: { eq: \"video\" }, size: { gt: 1073741824 }, modifiedAt: { gte: \"2025-08-03T00:00:00Z\" } }, sortBy: { field: modifiedAt, direction: DESC }) { edges { node { id, name, size, modifiedAt } } } }"
}
```
### Dataset Size Recommendations
- **Proof of Concept:** 100 - 500 examples to validate the approach.
- **Usable Prototype:** 1,000 - 5,000 examples for a reliable internal tool.
- **Production-Ready:** 10,000+ examples for a robust user-facing feature.
We will start by creating a Proof of Concept dataset.
## Acceptance Criteria
- A `training_data.jsonl` file is created in the project root.
- The file contains at least 300-500 high-quality examples.
- The dataset includes a mix of both QA and text-to-GraphQL pairs.
- The data is sufficient to begin an initial fine-tuning experiment.