spacedrive/.tasks/AI-002-create-finetuning-dataset.md
2025-09-06 21:00:37 -04:00

2.6 KiB

id, title, status, assignee, parent, priority, tags, whitepaper
id title status assignee parent priority tags whitepaper
AI-002 Create Fine-Tuning Dataset for AI Agent To Do james AI-000 Medium
ai
llm
training-data
research
Section 4.6: AI-Native VDFS

Description

To enhance the AI-native capabilities of Spacedrive, we need to fine-tune a Large Language Model (LLM) to understand the concepts, commands, and architecture of the system. This task involves creating a high-quality dataset for this purpose.

The dataset will enable the AI agent to answer user questions accurately and translate natural language commands into structured API calls.

Implementation Notes

The dataset will be created as a JSONL file (training_data.jsonl), where each line is a JSON object representing a training example. We will generate two primary types of examples:

1. Question-Answer (QA) Pairs

These pairs will teach the model the fundamental concepts of Spacedrive. They will be generated by parsing the whitepaper and technical documentation.

Example:

{
  "type": "qa",
  "question": "What is the dual purpose of the Content Identity system in Spacedrive?",
  "answer": "Spacedrive's Content Identity system serves a dual purpose: it eliminates storage waste through intelligent deduplication and simultaneously acts as a data guardian by tracking redundancy across all devices, turning content identification into a holistic data protection strategy."
}

2. Natural Language to GraphQL Pairs

These pairs will train the model to act as a "semantic parser," translating user requests into API calls for a hypothetical GraphQL endpoint.

Example:

{
  "type": "text-to-graphql",
  "natural_language_query": "find videos larger than 1GB that I modified in the last month, newest first",
  "graphql_query": "query { searchEntries(filter: { contentKind: { eq: \"video\" }, size: { gt: 1073741824 }, modifiedAt: { gte: \"2025-08-03T00:00:00Z\" } }, sortBy: { field: modifiedAt, direction: DESC }) { edges { node { id, name, size, modifiedAt } } } }"
}

Dataset Size Recommendations

  • Proof of Concept: 100 - 500 examples to validate the approach.
  • Usable Prototype: 1,000 - 5,000 examples for a reliable internal tool.
  • Production-Ready: 10,000+ examples for a robust user-facing feature.

We will start by creating a Proof of Concept dataset.

Acceptance Criteria

  • A training_data.jsonl file is created in the project root.
  • The file contains at least 300-500 high-quality examples.
  • The dataset includes a mix of both QA and text-to-GraphQL pairs.
  • The data is sufficient to begin an initial fine-tuning experiment.