spacedrive/AI-002-create-finetuning-dataset.md at f01a2994f1a1bf4bcc94c87036bc3cb84fbd858a

Gittes-archive/spacedrive

Fork 0

mirror of https://github.com/spacedriveapp/spacedrive.git synced 2025-12-11 20:15:30 +01:00

Jamie Pine 29a87a91ce huge refactor

2025-09-06 21:00:37 -04:00

2.6 KiB

Raw Blame History

id, title, status, assignee, parent, priority, tags, whitepaper

title

status

assignee

parent

priority

Description

To enhance the AI-native capabilities of Spacedrive, we need to fine-tune a Large Language Model (LLM) to understand the concepts, commands, and architecture of the system. This task involves creating a high-quality dataset for this purpose.

The dataset will enable the AI agent to answer user questions accurately and translate natural language commands into structured API calls.

Implementation Notes

The dataset will be created as a JSONL file (training_data.jsonl), where each line is a JSON object representing a training example. We will generate two primary types of examples:

1. Question-Answer (QA) Pairs

These pairs will teach the model the fundamental concepts of Spacedrive. They will be generated by parsing the whitepaper and technical documentation.

Example:

{
  "type": "qa",
  "question": "What is the dual purpose of the Content Identity system in Spacedrive?",
  "answer": "Spacedrive's Content Identity system serves a dual purpose: it eliminates storage waste through intelligent deduplication and simultaneously acts as a data guardian by tracking redundancy across all devices, turning content identification into a holistic data protection strategy."
}

2. Natural Language to GraphQL Pairs

These pairs will train the model to act as a "semantic parser," translating user requests into API calls for a hypothetical GraphQL endpoint.

Example:

{
  "type": "text-to-graphql",
  "natural_language_query": "find videos larger than 1GB that I modified in the last month, newest first",
  "graphql_query": "query { searchEntries(filter: { contentKind: { eq: \"video\" }, size: { gt: 1073741824 }, modifiedAt: { gte: \"2025-08-03T00:00:00Z\" } }, sortBy: { field: modifiedAt, direction: DESC }) { edges { node { id, name, size, modifiedAt } } } }"
}

Dataset Size Recommendations

Proof of Concept: 100 - 500 examples to validate the approach.
Usable Prototype: 1,000 - 5,000 examples for a reliable internal tool.
Production-Ready: 10,000+ examples for a robust user-facing feature.

We will start by creating a Proof of Concept dataset.

Acceptance Criteria

A training_data.jsonl file is created in the project root.
The file contains at least 300-500 high-quality examples.
The dataset includes a mix of both QA and text-to-GraphQL pairs.
The data is sufficient to begin an initial fine-tuning experiment.

2.6 KiB Raw Blame History