# Datasets

## Create a dataset from file upload, HuggingFace, or Kaggle

**post** `/api/v1/datasets`

Unified ingest endpoint. Discriminated by source.type: "file" returns upload instructions for a presigned S3 PUT, "huggingface" and "kaggle" start an async import.

### Body Parameters

- `source: object { file_format, name, type }  or object { files, type, url }  or object { files, type, url }`

  Dataset source configuration. Discriminated by `type`: file, huggingface, or kaggle.

  - `FileSourceDto = object { file_format, name, type }`

    - `file_format: "csv" or "json" or "jsonl" or "parquet"`

      Format of the file being uploaded

      - `"csv"`

      - `"json"`

      - `"jsonl"`

      - `"parquet"`

    - `name: string`

      Human-readable name for the dataset

    - `type: "file"`

      Source type

      - `"file"`

  - `HuggingfaceSourceDto = object { files, type, url }`

    - `files: array of string`

      File paths to download from the repository

    - `type: "huggingface"`

      Source type

      - `"huggingface"`

    - `url: string`

      HuggingFace dataset repository URL

  - `KaggleSourceDto = object { files, type, url }`

    - `files: array of string`

      File paths to download from the dataset

    - `type: "kaggle"`

      Source type

      - `"kaggle"`

    - `url: string`

      Kaggle dataset URL

### Returns

- `dataset_id: string`

  ID of the newly created dataset

- `status: string`

  Current dataset status

- `upload_instructions: optional object { method, s3_key, url }`

  Upload instructions for file sources. PUT your file to the provided URL.

  - `method: string`

    HTTP method to use

  - `s3_key: string`

    S3 object key — pass this back in the complete request if needed for verification

  - `url: string`

    Pre-signed URL for uploading the file

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADAPTION_API_KEY" \
    -d '{
          "source": {
            "file_format": "csv",
            "name": "my-training-data",
            "type": "file"
          }
        }'
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "status": "status",
  "upload_instructions": {
    "method": "PUT",
    "s3_key": "s3_key",
    "url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..."
  }
}
```

## Get a dataset by ID

**get** `/api/v1/datasets/{dataset_id}`

Get a dataset by ID

### Path Parameters

- `dataset_id: string`

### Returns

- `Dataset = object { configured_column_mapping, created_at, dataset_id, 8 more }`

  - `configured_column_mapping: object { chat, completion, context, prompt }`

    User-configured column mapping. Null if not yet configured.

    - `chat: string`

    - `completion: string`

    - `context: array of string`

    - `prompt: string`

  - `created_at: string`

    Timestamp when the dataset was created

  - `dataset_id: string`

    Unique dataset identifier

  - `error: object { message }`

    Error details if the dataset failed. Null otherwise.

    - `message: string`

      Error message

  - `evaluation_summary: object { grade_after, grade_before, improvement_percent, 2 more }`

    Compact evaluation summary. Null if evaluation has not completed.

    - `grade_after: string`

      Letter grade (A-E) after augmentation

    - `grade_before: string`

      Letter grade (A-E) before augmentation

    - `improvement_percent: number`

      Relative improvement percentage

    - `score_after: number`

      Quality score after augmentation

    - `score_before: number`

      Quality score before augmentation

  - `name: string`

    Human-readable name for the dataset

  - `progress: object { percent, processed_rows, total_rows }`

    Processing progress. Null when no run is active.

    - `percent: number`

      Progress percentage (0-100)

    - `processed_rows: number`

      Number of rows processed so far

    - `total_rows: number`

      Total rows to process (samples_to_process or row_count)

  - `row_count: number`

    Total number of rows in the dataset

  - `run_id: string`

    ID of the currently active run

  - `status: "pending" or "running" or "succeeded" or "failed"`

    Lifecycle status: pending, running, succeeded, or failed

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: string`

    Timestamp of the last update

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID \
    -H "Authorization: Bearer $ADAPTION_API_KEY"
```

#### Response

```json
{
  "configured_column_mapping": {
    "chat": "chat",
    "completion": "completion",
    "context": [
      "string"
    ],
    "prompt": "prompt"
  },
  "created_at": "2019-12-27T18:11:19.117Z",
  "dataset_id": "dataset_id",
  "error": {
    "message": "message"
  },
  "evaluation_summary": {
    "grade_after": "grade_after",
    "grade_before": "grade_before",
    "improvement_percent": 0,
    "score_after": 0,
    "score_before": 0
  },
  "name": "name",
  "progress": {
    "percent": 0,
    "processed_rows": 0,
    "total_rows": 0
  },
  "row_count": 0,
  "run_id": "run_id",
  "status": "pending",
  "updated_at": "2019-12-27T18:11:19.117Z"
}
```

## List datasets

**get** `/api/v1/datasets`

List datasets

### Query Parameters

- `created_after: optional string`

  ISO 8601 datetime — datasets created after this time.

- `created_before: optional string`

  ISO 8601 datetime — datasets created before this time.

- `cursor: optional string`

  Cursor from the previous response next_cursor field.

- `limit: optional number`

  Number of results (max 100, default 20). Used with cursor pagination.

- `q: optional string`

  Search by dataset name (case-insensitive contains).

- `sort: optional string`

  Sort field: created_at | updated_at | name (default: created_at).

- `sort_direction: optional string`

  Sort direction: asc | desc (default: desc).

- `status: optional string`

  Filter by status: pending | running | succeeded | failed

### Returns

- `datasets: array of object { created_at, dataset_id, status, 4 more }`

  Page of datasets

  - `created_at: string`

    Timestamp when the dataset was created

  - `dataset_id: string`

    Dataset ID

  - `status: "pending" or "running" or "succeeded" or "failed"`

    Dataset status

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: string`

    Last updated timestamp

  - `description: optional string`

    Auto-generated description of the dataset contents

  - `name: optional string`

    Dataset name

  - `row_count: optional number`

    Total number of rows

- `next_cursor: optional string`

  Cursor for the next page. Null when no more results.

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets \
    -H "Authorization: Bearer $ADAPTION_API_KEY"
```

#### Response

```json
{
  "datasets": [
    {
      "created_at": "2019-12-27T18:11:19.117Z",
      "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "pending",
      "updated_at": "2019-12-27T18:11:19.117Z",
      "description": "description",
      "name": "My training data",
      "row_count": 1000
    }
  ],
  "next_cursor": "550e8400-e29b-41d4-a716-446655440000"
}
```

## Get the processing status of a dataset

**get** `/api/v1/datasets/{dataset_id}/status`

Get the processing status of a dataset

### Path Parameters

- `dataset_id: string`

### Returns

- `dataset_id: string`

  Dataset ID

- `error: object { message }`

  Error details if the dataset failed. Null otherwise.

  - `message: string`

    Error message

- `progress: object { percent, processed_rows, total_rows }`

  Processing progress. Null when no run is active.

  - `percent: number`

    Progress percentage (0-100)

  - `processed_rows: number`

    Number of rows processed so far

  - `total_rows: number`

    Total rows to process (samples_to_process or row_count)

- `row_count: number`

  Number of rows in the dataset

- `status: "pending" or "running" or "succeeded" or "failed"`

  Current processing status

  - `"pending"`

  - `"running"`

  - `"succeeded"`

  - `"failed"`

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/status \
    -H "Authorization: Bearer $ADAPTION_API_KEY"
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "error": {
    "message": "message"
  },
  "progress": {
    "percent": 0,
    "processed_rows": 0,
    "total_rows": 0
  },
  "row_count": 0,
  "status": "pending"
}
```

## Download the processed dataset

**get** `/api/v1/datasets/{dataset_id}/download`

Download the processed dataset

### Path Parameters

- `dataset_id: string`

### Query Parameters

- `fileFormat: optional "csv" or "json" or "jsonl" or "parquet"`

  Output file format. Defaults to the original upload format if omitted.

  - `"csv"`

  - `"json"`

  - `"jsonl"`

  - `"parquet"`

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/download \
    -H "Authorization: Bearer $ADAPTION_API_KEY"
```

#### Response

```json
"Example data"
```

## Publish a dataset to an external platform

**post** `/api/v1/datasets/{dataset_id}/publish`

Publishes the processed dataset to Hugging Face or Kaggle. Currently returns 501 — not yet implemented.

### Path Parameters

- `dataset_id: string`

### Body Parameters

- `target: "huggingface" or "kaggle"`

  Destination platform for publishing the dataset

  - `"huggingface"`

  - `"kaggle"`

- `target_spec: optional map[unknown]`

  Target-specific configuration (e.g. repo name for HuggingFace, slug for Kaggle)

### Returns

- `publish_id: string`

  Unique identifier for the publish job

- `status: string`

  Status of the publish job

- `message: optional string`

  Additional information about the publish request

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/publish \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADAPTION_API_KEY" \
    -d '{
          "target": "huggingface",
          "target_spec": {
            "repo_name": "bar",
            "private": "bar"
          }
        }'
```

#### Response

```json
{
  "publish_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "message": "message"
}
```

## Start an augmentation run (or estimate cost)

**post** `/api/v1/datasets/{dataset_id}/run`

Validates column mapping and recipe configuration, reserves credits, and starts the augmentation pipeline. Set estimate=true to validate and get a cost quote without starting a run.

### Path Parameters

- `dataset_id: string`

### Body Parameters

- `brand_controls: optional object { hallucination_mitigation, length, safety_categories }`

  Brand and quality controls for generated completions (length, safety, hallucination grounding).

  - `hallucination_mitigation: optional boolean`

    Enable web-search grounding to reduce hallucinations in generated completions

  - `length: optional "minimal" or "concise" or "detailed" or "extensive"`

    Target response length. Controls verbosity of generated completions.

    - `"minimal"`

    - `"concise"`

    - `"detailed"`

    - `"extensive"`

  - `safety_categories: optional array of string`

    Content safety categories to enforce. Completions violating these are filtered.

- `column_mapping: optional object { prompt, chat, completion, context }`

  Column role assignments for augmentation. Required for real runs, optional for estimate-only requests.

  - `prompt: string`

    Column to use as the prompt/instruction field

  - `chat: optional string`

    Column containing chat/conversation data (alternative to prompt+completion)

  - `completion: optional string`

    Column to use as the completion/response field

  - `context: optional array of string`

    Columns to include as context

- `estimate: optional boolean`

  When true, validates the request and returns the estimated credit cost without starting a run.

- `job_specification: optional object { idempotency_key, max_rows }`

  Job execution parameters

  - `idempotency_key: optional string`

    Client-generated idempotency key for safe retries. If a launch with the same key already exists, the original response is returned.

  - `max_rows: optional number`

    Maximum number of rows to process in this run

- `recipe_specification: optional object { recipes, version }`

  Augmentation recipe configuration. Omitted recipes use backend defaults.

  - `recipes: optional object { deduplication, preference_pairs, prompt_metadata_injection, 2 more }`

    Augmentation recipe toggles. Omitted recipes use backend defaults.

    - `deduplication: optional boolean`

      Remove near-duplicate rows

    - `preference_pairs: optional boolean`

      Generate DPO-style preference pairs (chosen/rejected) instead of instruction completions

    - `prompt_metadata_injection: optional boolean`

      Inject context and constraints into prompts

    - `prompt_rephrase: optional boolean`

      Rephrase prompts for variety and clarity

    - `reasoning_traces: optional boolean`

      Add reasoning traces (chain-of-thought) to completions

  - `version: optional string`

    Recipe schema version. Allows recipe options to evolve across releases.

### Returns

- `estimate: boolean`

  Whether this was an estimate-only request (no run started)

- `estimatedCreditsConsumed: number`

  Estimated number of credits that will be consumed by this run

- `estimatedMinutes: number`

  Estimated processing time in minutes

- `run_id: optional string`

  Unique identifier for this pipeline run. Null for estimate-only requests.

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/run \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADAPTION_API_KEY" \
    -d '{}'
```

#### Response

```json
{
  "estimate": true,
  "estimatedCreditsConsumed": 0,
  "estimatedMinutes": 0,
  "run_id": "dataset-550e8400-e29b-41d4-a716-446655440000-1712234567890"
}
```

## Get evaluation results for a dataset

**get** `/api/v1/datasets/{dataset_id}/evaluation`

Get evaluation results for a dataset

### Path Parameters

- `dataset_id: string`

### Returns

- `dataset_id: string`

  Dataset ID

- `quality: object { grade_after, grade_before, improvement_percent, 3 more }`

  Structured quality metrics. Null until evaluation completes.

  - `grade_after: string`

    Letter grade (A-E) after augmentation

  - `grade_before: string`

    Letter grade (A-E) before augmentation

  - `improvement_percent: number`

    Relative quality improvement as a percentage

  - `percentile_after: number`

    Percentile rank (0-100) after augmentation

  - `score_after: number`

    Quality score (0-10) after augmentation

  - `score_before: number`

    Quality score (0-10) before augmentation

- `raw_results: map[unknown]`

  Raw evaluation results payload for advanced use. Null until evaluation completes.

- `status: string`

  Evaluation pipeline status: pending | running | succeeded | failed | skipped

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/evaluation \
    -H "Authorization: Bearer $ADAPTION_API_KEY"
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "quality": {
    "grade_after": "A",
    "grade_before": "C",
    "improvement_percent": 37.1,
    "percentile_after": 92.3,
    "score_after": 8.5,
    "score_before": 6.2
  },
  "raw_results": {
    "foo": "bar"
  },
  "status": "succeeded"
}
```

## Domain Types

### Dataset

- `Dataset = object { configured_column_mapping, created_at, dataset_id, 8 more }`

  - `configured_column_mapping: object { chat, completion, context, prompt }`

    User-configured column mapping. Null if not yet configured.

    - `chat: string`

    - `completion: string`

    - `context: array of string`

    - `prompt: string`

  - `created_at: string`

    Timestamp when the dataset was created

  - `dataset_id: string`

    Unique dataset identifier

  - `error: object { message }`

    Error details if the dataset failed. Null otherwise.

    - `message: string`

      Error message

  - `evaluation_summary: object { grade_after, grade_before, improvement_percent, 2 more }`

    Compact evaluation summary. Null if evaluation has not completed.

    - `grade_after: string`

      Letter grade (A-E) after augmentation

    - `grade_before: string`

      Letter grade (A-E) before augmentation

    - `improvement_percent: number`

      Relative improvement percentage

    - `score_after: number`

      Quality score after augmentation

    - `score_before: number`

      Quality score before augmentation

  - `name: string`

    Human-readable name for the dataset

  - `progress: object { percent, processed_rows, total_rows }`

    Processing progress. Null when no run is active.

    - `percent: number`

      Progress percentage (0-100)

    - `processed_rows: number`

      Number of rows processed so far

    - `total_rows: number`

      Total rows to process (samples_to_process or row_count)

  - `row_count: number`

    Total number of rows in the dataset

  - `run_id: string`

    ID of the currently active run

  - `status: "pending" or "running" or "succeeded" or "failed"`

    Lifecycle status: pending, running, succeeded, or failed

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: string`

    Timestamp of the last update

# Upload

## Initiate a dataset upload

**post** `/api/v1/datasets/upload/initiate`

Initiate a dataset upload

### Body Parameters

- `file_format: "csv" or "json" or "jsonl" or "parquet"`

  Format of the file being uploaded

  - `"csv"`

  - `"json"`

  - `"jsonl"`

  - `"parquet"`

- `name: string`

  Human-readable name for the dataset

### Returns

- `upload_url: string`

  Pre-signed S3 URL — upload the file directly to this URL via HTTP PUT

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/upload/initiate \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADAPTION_API_KEY" \
    -d '{
          "file_format": "csv",
          "name": "my-training-data"
        }'
```

#### Response

```json
{
  "upload_url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..."
}
```

## Complete a dataset upload and trigger processing

**post** `/api/v1/datasets/upload/complete`

Complete a dataset upload and trigger processing

### Body Parameters

- `file_format: "csv" or "json" or "jsonl" or "parquet"`

  Format of the uploaded file

  - `"csv"`

  - `"json"`

  - `"jsonl"`

  - `"parquet"`

- `file_size_bytes: number`

  Size of the uploaded file in bytes

- `name: string`

  Human-readable name for the dataset

- `s3_key: string`

  S3 object key returned in the pre-signed URL response from /upload/initiate

### Returns

- `dataset_id: string`

  ID of the newly created dataset

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/upload/complete \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADAPTION_API_KEY" \
    -d '{
          "file_format": "csv",
          "file_size_bytes": 1048576,
          "name": "my-training-data",
          "s3_key": "uploads/550e8400-e29b-41d4-a716-446655440000/my-training-data.csv"
        }'
```

#### Response

```json
{
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

## Complete a file upload and trigger processing

**post** `/api/v1/datasets/{dataset_id}/upload/complete`

File uploads only. Call after uploading bytes to the presigned URL from POST /datasets. Verifies the file exists in S3, then triggers the preprocessing pipeline.

### Path Parameters

- `dataset_id: string`

### Body Parameters

- `file_size_bytes: number`

  Size of the uploaded file in bytes (for verification)

- `sha256: optional string`

  SHA-256 hex digest of the uploaded file (for integrity verification)

### Returns

- `dataset_id: string`

  ID of the dataset

- `status: string`

  Current status of the dataset after completing upload

### Example

```http
curl https://api.adaptionlabs.ai/api/v1/datasets/$DATASET_ID/upload/complete \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $ADAPTION_API_KEY" \
    -d '{
          "file_size_bytes": 1048576,
          "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
        }'
```

#### Response

```json
{
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing"
}
```