mirror of
https://github.com/spacedriveapp/spacedrive.git
synced 2025-12-11 20:15:30 +01:00
67 lines
2.6 KiB
Markdown
67 lines
2.6 KiB
Markdown
---
|
|
id: AI-002
|
|
title: "Create Fine-Tuning Dataset for AI Agent"
|
|
status: To Do
|
|
assignee: james
|
|
parent: AI-000
|
|
priority: Medium
|
|
tags:
|
|
- ai
|
|
- llm
|
|
- training-data
|
|
- research
|
|
whitepaper: "Section 4.6: AI-Native VDFS"
|
|
---
|
|
|
|
## Description
|
|
|
|
To enhance the AI-native capabilities of Spacedrive, we need to fine-tune a Large Language Model (LLM) to understand the concepts, commands, and architecture of the system. This task involves creating a high-quality dataset for this purpose.
|
|
|
|
The dataset will enable the AI agent to answer user questions accurately and translate natural language commands into structured API calls.
|
|
|
|
## Implementation Notes
|
|
|
|
The dataset will be created as a JSONL file (`training_data.jsonl`), where each line is a JSON object representing a training example. We will generate two primary types of examples:
|
|
|
|
### 1. Question-Answer (QA) Pairs
|
|
|
|
These pairs will teach the model the fundamental concepts of Spacedrive. They will be generated by parsing the whitepaper and technical documentation.
|
|
|
|
**Example:**
|
|
```json
|
|
{
|
|
"type": "qa",
|
|
"question": "What is the dual purpose of the Content Identity system in Spacedrive?",
|
|
"answer": "Spacedrive's Content Identity system serves a dual purpose: it eliminates storage waste through intelligent deduplication and simultaneously acts as a data guardian by tracking redundancy across all devices, turning content identification into a holistic data protection strategy."
|
|
}
|
|
```
|
|
|
|
### 2. Natural Language to GraphQL Pairs
|
|
|
|
These pairs will train the model to act as a "semantic parser," translating user requests into API calls for a hypothetical GraphQL endpoint.
|
|
|
|
**Example:**
|
|
```json
|
|
{
|
|
"type": "text-to-graphql",
|
|
"natural_language_query": "find videos larger than 1GB that I modified in the last month, newest first",
|
|
"graphql_query": "query { searchEntries(filter: { contentKind: { eq: \"video\" }, size: { gt: 1073741824 }, modifiedAt: { gte: \"2025-08-03T00:00:00Z\" } }, sortBy: { field: modifiedAt, direction: DESC }) { edges { node { id, name, size, modifiedAt } } } }"
|
|
}
|
|
```
|
|
|
|
### Dataset Size Recommendations
|
|
|
|
- **Proof of Concept:** 100 - 500 examples to validate the approach.
|
|
- **Usable Prototype:** 1,000 - 5,000 examples for a reliable internal tool.
|
|
- **Production-Ready:** 10,000+ examples for a robust user-facing feature.
|
|
|
|
We will start by creating a Proof of Concept dataset.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- A `training_data.jsonl` file is created in the project root.
|
|
- The file contains at least 300-500 high-quality examples.
|
|
- The dataset includes a mix of both QA and text-to-GraphQL pairs.
|
|
- The data is sufficient to begin an initial fine-tuning experiment.
|
|
|