# Datasets ## Create a dataset from file upload, HuggingFace, or Kaggle **post** `/api/v1/datasets` Unified ingest endpoint. Discriminated by source.type: "file" returns upload instructions for a presigned S3 PUT, "huggingface" and "kaggle" start an async import. ### Body Parameters - `source: object { file_format, name, type } or object { files, type, url } or object { files, type, url }` Dataset source configuration. Discriminated by `type`: file, huggingface, or kaggle. - `FileSourceDto = object { file_format, name, type }` - `file_format: "csv" or "json" or "jsonl" or "parquet"` Format of the file being uploaded - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` - `name: string` Human-readable name for the dataset - `type: "file"` Source type - `"file"` - `HuggingfaceSourceDto = object { files, type, url }` - `files: array of string` File paths to download from the repository - `type: "huggingface"` Source type - `"huggingface"` - `url: string` HuggingFace dataset repository URL - `KaggleSourceDto = object { files, type, url }` - `files: array of string` File paths to download from the dataset - `type: "kaggle"` Source type - `"kaggle"` - `url: string` Kaggle dataset URL ### Returns - `dataset_id: string` ID of the newly created dataset - `status: string` Current dataset status - `upload_instructions: optional object { method, s3_key, url }` Upload instructions for file sources. PUT your file to the provided URL. - `method: string` HTTP method to use - `s3_key: string` S3 object key — pass this back in the complete request if needed for verification - `url: string` Pre-signed URL for uploading the file ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADAPTION_API_KEY" \ -d '{ "source": { "file_format": "csv", "name": "my-training-data", "type": "file" } }' ``` #### Response ```json { "dataset_id": "dataset_id", "status": "status", "upload_instructions": { "method": "PUT", "s3_key": "s3_key", "url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..." } } ``` ## Get a dataset by ID **get** `/api/v1/datasets/{dataset_id}` Get a dataset by ID ### Path Parameters - `dataset_id: string` ### Returns - `Dataset = object { configured_column_mapping, created_at, dataset_id, 8 more }` - `configured_column_mapping: object { chat, completion, context, prompt }` User-configured column mapping. Null if not yet configured. - `chat: string` - `completion: string` - `context: array of string` - `prompt: string` - `created_at: string` Timestamp when the dataset was created - `dataset_id: string` Unique dataset identifier - `error: object { message }` Error details if the dataset failed. Null otherwise. - `message: string` Error message - `evaluation_summary: object { grade_after, grade_before, improvement_percent, 2 more }` Compact evaluation summary. Null if evaluation has not completed. - `grade_after: string` Letter grade (A-E) after augmentation - `grade_before: string` Letter grade (A-E) before augmentation - `improvement_percent: number` Relative improvement percentage - `score_after: number` Quality score after augmentation - `score_before: number` Quality score before augmentation - `name: string` Human-readable name for the dataset - `progress: object { percent, processed_rows, total_rows }` Processing progress. Null when no run is active. - `percent: number` Progress percentage (0-100) - `processed_rows: number` Number of rows processed so far - `total_rows: number` Total rows to process (samples_to_process or row_count) - `row_count: number` Total number of rows in the dataset - `run_id: string` ID of the currently active run - `status: "pending" or "running" or "succeeded" or "failed"` Lifecycle status: pending, running, succeeded, or failed - `"pending"` - `"running"` - `"succeeded"` - `"failed"` - `updated_at: string` Timestamp of the last update ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID \ -H "Authorization: Bearer $ADAPTION_API_KEY" ``` #### Response ```json { "configured_column_mapping": { "chat": "chat", "completion": "completion", "context": [ "string" ], "prompt": "prompt" }, "created_at": "2019-12-27T18:11:19.117Z", "dataset_id": "dataset_id", "error": { "message": "message" }, "evaluation_summary": { "grade_after": "grade_after", "grade_before": "grade_before", "improvement_percent": 0, "score_after": 0, "score_before": 0 }, "name": "name", "progress": { "percent": 0, "processed_rows": 0, "total_rows": 0 }, "row_count": 0, "run_id": "run_id", "status": "pending", "updated_at": "2019-12-27T18:11:19.117Z" } ``` ## List datasets **get** `/api/v1/datasets` List datasets ### Query Parameters - `created_after: optional string` ISO 8601 datetime — datasets created after this time. - `created_before: optional string` ISO 8601 datetime — datasets created before this time. - `cursor: optional string` Cursor from the previous response next_cursor field. - `limit: optional number` Number of results (max 100, default 20). Used with cursor pagination. - `q: optional string` Search by dataset name (case-insensitive contains). - `sort: optional string` Sort field: created_at | updated_at | name (default: created_at). - `sort_direction: optional string` Sort direction: asc | desc (default: desc). - `status: optional string` Filter by status: pending | running | succeeded | failed ### Returns - `datasets: array of object { created_at, dataset_id, status, 4 more }` Page of datasets - `created_at: string` Timestamp when the dataset was created - `dataset_id: string` Dataset ID - `status: "pending" or "running" or "succeeded" or "failed"` Dataset status - `"pending"` - `"running"` - `"succeeded"` - `"failed"` - `updated_at: string` Last updated timestamp - `description: optional string` Auto-generated description of the dataset contents - `name: optional string` Dataset name - `row_count: optional number` Total number of rows - `next_cursor: optional string` Cursor for the next page. Null when no more results. ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets \ -H "Authorization: Bearer $ADAPTION_API_KEY" ``` #### Response ```json { "datasets": [ { "created_at": "2019-12-27T18:11:19.117Z", "dataset_id": "550e8400-e29b-41d4-a716-446655440000", "status": "pending", "updated_at": "2019-12-27T18:11:19.117Z", "description": "description", "name": "My training data", "row_count": 1000 } ], "next_cursor": "550e8400-e29b-41d4-a716-446655440000" } ``` ## Get the processing status of a dataset **get** `/api/v1/datasets/{dataset_id}/status` Get the processing status of a dataset ### Path Parameters - `dataset_id: string` ### Returns - `dataset_id: string` Dataset ID - `error: object { message }` Error details if the dataset failed. Null otherwise. - `message: string` Error message - `progress: object { percent, processed_rows, total_rows }` Processing progress. Null when no run is active. - `percent: number` Progress percentage (0-100) - `processed_rows: number` Number of rows processed so far - `total_rows: number` Total rows to process (samples_to_process or row_count) - `row_count: number` Number of rows in the dataset - `status: "pending" or "running" or "succeeded" or "failed"` Current processing status - `"pending"` - `"running"` - `"succeeded"` - `"failed"` ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/status \ -H "Authorization: Bearer $ADAPTION_API_KEY" ``` #### Response ```json { "dataset_id": "dataset_id", "error": { "message": "message" }, "progress": { "percent": 0, "processed_rows": 0, "total_rows": 0 }, "row_count": 0, "status": "pending" } ``` ## Download the processed dataset **get** `/api/v1/datasets/{dataset_id}/download` Download the processed dataset ### Path Parameters - `dataset_id: string` ### Query Parameters - `fileFormat: optional "csv" or "json" or "jsonl" or "parquet"` Output file format. Defaults to the original upload format if omitted. - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/download \ -H "Authorization: Bearer $ADAPTION_API_KEY" ``` #### Response ```json "Example data" ``` ## Publish a dataset to an external platform **post** `/api/v1/datasets/{dataset_id}/publish` Publishes the processed dataset to Hugging Face or Kaggle. Currently returns 501 — not yet implemented. ### Path Parameters - `dataset_id: string` ### Body Parameters - `target: "huggingface" or "kaggle"` Destination platform for publishing the dataset - `"huggingface"` - `"kaggle"` - `target_spec: optional map[unknown]` Target-specific configuration (e.g. repo name for HuggingFace, slug for Kaggle) ### Returns - `publish_id: string` Unique identifier for the publish job - `status: string` Status of the publish job - `message: optional string` Additional information about the publish request ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/publish \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADAPTION_API_KEY" \ -d '{ "target": "huggingface", "target_spec": { "repo_name": "bar", "private": "bar" } }' ``` #### Response ```json { "publish_id": "550e8400-e29b-41d4-a716-446655440000", "status": "queued", "message": "message" } ``` ## Start an augmentation run (or estimate cost) **post** `/api/v1/datasets/{dataset_id}/run` Validates column mapping and recipe configuration, reserves credits, and starts the augmentation pipeline. Set estimate=true to validate and get a cost quote without starting a run. ### Path Parameters - `dataset_id: string` ### Body Parameters - `brand_controls: optional object { hallucination_mitigation, length, safety_categories }` Brand and quality controls for generated completions (length, safety, hallucination grounding). - `hallucination_mitigation: optional boolean` Enable web-search grounding to reduce hallucinations in generated completions - `length: optional "minimal" or "concise" or "detailed" or "extensive"` Target response length. Controls verbosity of generated completions. - `"minimal"` - `"concise"` - `"detailed"` - `"extensive"` - `safety_categories: optional array of string` Content safety categories to enforce. Completions violating these are filtered. - `column_mapping: optional object { prompt, chat, completion, context }` Column role assignments for augmentation. Required for real runs, optional for estimate-only requests. - `prompt: string` Column to use as the prompt/instruction field - `chat: optional string` Column containing chat/conversation data (alternative to prompt+completion) - `completion: optional string` Column to use as the completion/response field - `context: optional array of string` Columns to include as context - `estimate: optional boolean` When true, validates the request and returns the estimated credit cost without starting a run. - `job_specification: optional object { idempotency_key, max_rows }` Job execution parameters - `idempotency_key: optional string` Client-generated idempotency key for safe retries. If a launch with the same key already exists, the original response is returned. - `max_rows: optional number` Maximum number of rows to process in this run - `recipe_specification: optional object { recipes, version }` Augmentation recipe configuration. Omitted recipes use backend defaults. - `recipes: optional object { deduplication, preference_pairs, prompt_metadata_injection, 2 more }` Augmentation recipe toggles. Omitted recipes use backend defaults. - `deduplication: optional boolean` Remove near-duplicate rows - `preference_pairs: optional boolean` Generate DPO-style preference pairs (chosen/rejected) instead of instruction completions - `prompt_metadata_injection: optional boolean` Inject context and constraints into prompts - `prompt_rephrase: optional boolean` Rephrase prompts for variety and clarity - `reasoning_traces: optional boolean` Add reasoning traces (chain-of-thought) to completions - `version: optional string` Recipe schema version. Allows recipe options to evolve across releases. ### Returns - `estimate: boolean` Whether this was an estimate-only request (no run started) - `estimatedCreditsConsumed: number` Estimated number of credits that will be consumed by this run - `estimatedMinutes: number` Estimated processing time in minutes - `run_id: optional string` Unique identifier for this pipeline run. Null for estimate-only requests. ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/run \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADAPTION_API_KEY" \ -d '{}' ``` #### Response ```json { "estimate": true, "estimatedCreditsConsumed": 0, "estimatedMinutes": 0, "run_id": "dataset-550e8400-e29b-41d4-a716-446655440000-1712234567890" } ``` ## Get evaluation results for a dataset **get** `/api/v1/datasets/{dataset_id}/evaluation` Get evaluation results for a dataset ### Path Parameters - `dataset_id: string` ### Returns - `dataset_id: string` Dataset ID - `quality: object { grade_after, grade_before, improvement_percent, 3 more }` Structured quality metrics. Null until evaluation completes. - `grade_after: string` Letter grade (A-E) after augmentation - `grade_before: string` Letter grade (A-E) before augmentation - `improvement_percent: number` Relative quality improvement as a percentage - `percentile_after: number` Percentile rank (0-100) after augmentation - `score_after: number` Quality score (0-10) after augmentation - `score_before: number` Quality score (0-10) before augmentation - `raw_results: map[unknown]` Raw evaluation results payload for advanced use. Null until evaluation completes. - `status: string` Evaluation pipeline status: pending | running | succeeded | failed | skipped ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/evaluation \ -H "Authorization: Bearer $ADAPTION_API_KEY" ``` #### Response ```json { "dataset_id": "dataset_id", "quality": { "grade_after": "A", "grade_before": "C", "improvement_percent": 37.1, "percentile_after": 92.3, "score_after": 8.5, "score_before": 6.2 }, "raw_results": { "foo": "bar" }, "status": "succeeded" } ``` ## Domain Types ### Dataset - `Dataset = object { configured_column_mapping, created_at, dataset_id, 8 more }` - `configured_column_mapping: object { chat, completion, context, prompt }` User-configured column mapping. Null if not yet configured. - `chat: string` - `completion: string` - `context: array of string` - `prompt: string` - `created_at: string` Timestamp when the dataset was created - `dataset_id: string` Unique dataset identifier - `error: object { message }` Error details if the dataset failed. Null otherwise. - `message: string` Error message - `evaluation_summary: object { grade_after, grade_before, improvement_percent, 2 more }` Compact evaluation summary. Null if evaluation has not completed. - `grade_after: string` Letter grade (A-E) after augmentation - `grade_before: string` Letter grade (A-E) before augmentation - `improvement_percent: number` Relative improvement percentage - `score_after: number` Quality score after augmentation - `score_before: number` Quality score before augmentation - `name: string` Human-readable name for the dataset - `progress: object { percent, processed_rows, total_rows }` Processing progress. Null when no run is active. - `percent: number` Progress percentage (0-100) - `processed_rows: number` Number of rows processed so far - `total_rows: number` Total rows to process (samples_to_process or row_count) - `row_count: number` Total number of rows in the dataset - `run_id: string` ID of the currently active run - `status: "pending" or "running" or "succeeded" or "failed"` Lifecycle status: pending, running, succeeded, or failed - `"pending"` - `"running"` - `"succeeded"` - `"failed"` - `updated_at: string` Timestamp of the last update # Upload ## Initiate a dataset upload **post** `/api/v1/datasets/upload/initiate` Initiate a dataset upload ### Body Parameters - `file_format: "csv" or "json" or "jsonl" or "parquet"` Format of the file being uploaded - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` - `name: string` Human-readable name for the dataset ### Returns - `upload_url: string` Pre-signed S3 URL — upload the file directly to this URL via HTTP PUT ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/upload/initiate \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADAPTION_API_KEY" \ -d '{ "file_format": "csv", "name": "my-training-data" }' ``` #### Response ```json { "upload_url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..." } ``` ## Complete a dataset upload and trigger processing **post** `/api/v1/datasets/upload/complete` Complete a dataset upload and trigger processing ### Body Parameters - `file_format: "csv" or "json" or "jsonl" or "parquet"` Format of the uploaded file - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` - `file_size_bytes: number` Size of the uploaded file in bytes - `name: string` Human-readable name for the dataset - `s3_key: string` S3 object key returned in the pre-signed URL response from /upload/initiate ### Returns - `dataset_id: string` ID of the newly created dataset ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/upload/complete \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADAPTION_API_KEY" \ -d '{ "file_format": "csv", "file_size_bytes": 1048576, "name": "my-training-data", "s3_key": "uploads/550e8400-e29b-41d4-a716-446655440000/my-training-data.csv" }' ``` #### Response ```json { "dataset_id": "550e8400-e29b-41d4-a716-446655440000" } ``` ## Complete a file upload and trigger processing **post** `/api/v1/datasets/{dataset_id}/upload/complete` File uploads only. Call after uploading bytes to the presigned URL from POST /datasets. Verifies the file exists in S3, then triggers the preprocessing pipeline. ### Path Parameters - `dataset_id: string` ### Body Parameters - `file_size_bytes: number` Size of the uploaded file in bytes (for verification) - `sha256: optional string` SHA-256 hex digest of the uploaded file (for integrity verification) ### Returns - `dataset_id: string` ID of the dataset - `status: string` Current status of the dataset after completing upload ### Example ```http curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/upload/complete \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ADAPTION_API_KEY" \ -d '{ "file_size_bytes": 1048576, "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" }' ``` #### Response ```json { "dataset_id": "550e8400-e29b-41d4-a716-446655440000", "status": "processing" } ```