--- id: AI-002 title: "Create Fine-Tuning Dataset for AI Agent" status: To Do assignee: james parent: AI-000 priority: Medium tags: - ai - llm - training-data - research whitepaper: "Section 4.6: AI-Native VDFS" --- ## Description To enhance the AI-native capabilities of Spacedrive, we need to fine-tune a Large Language Model (LLM) to understand the concepts, commands, and architecture of the system. This task involves creating a high-quality dataset for this purpose. The dataset will enable the AI agent to answer user questions accurately and translate natural language commands into structured API calls. ## Implementation Notes The dataset will be created as a JSONL file (`training_data.jsonl`), where each line is a JSON object representing a training example. We will generate two primary types of examples: ### 1. Question-Answer (QA) Pairs These pairs will teach the model the fundamental concepts of Spacedrive. They will be generated by parsing the whitepaper and technical documentation. **Example:** ```json { "type": "qa", "question": "What is the dual purpose of the Content Identity system in Spacedrive?", "answer": "Spacedrive's Content Identity system serves a dual purpose: it eliminates storage waste through intelligent deduplication and simultaneously acts as a data guardian by tracking redundancy across all devices, turning content identification into a holistic data protection strategy." } ``` ### 2. Natural Language to GraphQL Pairs These pairs will train the model to act as a "semantic parser," translating user requests into API calls for a hypothetical GraphQL endpoint. **Example:** ```json { "type": "text-to-graphql", "natural_language_query": "find videos larger than 1GB that I modified in the last month, newest first", "graphql_query": "query { searchEntries(filter: { contentKind: { eq: \"video\" }, size: { gt: 1073741824 }, modifiedAt: { gte: \"2025-08-03T00:00:00Z\" } }, sortBy: { field: modifiedAt, direction: DESC }) { edges { node { id, name, size, modifiedAt } } } }" } ``` ### Dataset Size Recommendations - **Proof of Concept:** 100 - 500 examples to validate the approach. - **Usable Prototype:** 1,000 - 5,000 examples for a reliable internal tool. - **Production-Ready:** 10,000+ examples for a robust user-facing feature. We will start by creating a Proof of Concept dataset. ## Acceptance Criteria - A `training_data.jsonl` file is created in the project root. - The file contains at least 300-500 high-quality examples. - The dataset includes a mix of both QA and text-to-GraphQL pairs. - The data is sufficient to begin an initial fine-tuning experiment.