# Datasets

## Create a dataset from file upload, HuggingFace, or Kaggle

`datasets.create(DatasetCreateParams**kwargs)  -> DatasetCreateResponse`

**post** `/api/v1/datasets`

Unified ingest endpoint. Discriminated by source.type: "file" returns upload instructions for a presigned S3 PUT, "huggingface" and "kaggle" start an async import.

### Parameters

- `source: Source`

  Dataset source configuration. Discriminated by `type`: file, huggingface, or kaggle.

  - `class SourceFileSourceDto: …`

    - `file_format: Literal["csv", "json", "jsonl", "parquet"]`

      Format of the file being uploaded

      - `"csv"`

      - `"json"`

      - `"jsonl"`

      - `"parquet"`

    - `name: str`

      Human-readable name for the dataset

    - `type: Literal["file"]`

      Source type

      - `"file"`

  - `class SourceHuggingfaceSourceDto: …`

    - `files: SequenceNotStr[str]`

      File paths to download from the repository

    - `type: Literal["huggingface"]`

      Source type

      - `"huggingface"`

    - `url: str`

      HuggingFace dataset repository URL

  - `class SourceKaggleSourceDto: …`

    - `files: SequenceNotStr[str]`

      File paths to download from the dataset

    - `type: Literal["kaggle"]`

      Source type

      - `"kaggle"`

    - `url: str`

      Kaggle dataset URL

### Returns

- `class DatasetCreateResponse: …`

  - `dataset_id: str`

    ID of the newly created dataset

  - `status: str`

    Current dataset status

  - `upload_instructions: Optional[UploadInstructions]`

    Upload instructions for file sources. PUT your file to the provided URL.

    - `method: str`

      HTTP method to use

    - `s3_key: str`

      S3 object key — pass this back in the complete request if needed for verification

    - `url: str`

      Pre-signed URL for uploading the file

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
dataset = client.datasets.create(
    source={
        "file_format": "csv",
        "name": "my-training-data",
        "type": "file",
    },
)
print(dataset.dataset_id)
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "status": "status",
  "upload_instructions": {
    "method": "PUT",
    "s3_key": "s3_key",
    "url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..."
  }
}
```

## Get a dataset by ID

`datasets.get(strdataset_id)  -> Dataset`

**get** `/api/v1/datasets/{dataset_id}`

Get a dataset by ID

### Parameters

- `dataset_id: str`

### Returns

- `class Dataset: …`

  - `configured_column_mapping: Optional[ConfiguredColumnMapping]`

    User-configured column mapping. Null if not yet configured.

    - `chat: Optional[str]`

    - `completion: Optional[str]`

    - `context: List[str]`

    - `prompt: Optional[str]`

  - `created_at: datetime`

    Timestamp when the dataset was created

  - `dataset_id: str`

    Unique dataset identifier

  - `error: Optional[Error]`

    Error details if the dataset failed. Null otherwise.

    - `message: str`

      Error message

  - `evaluation_summary: Optional[EvaluationSummary]`

    Compact evaluation summary. Null if evaluation has not completed.

    - `grade_after: Optional[str]`

      Letter grade (A-E) after augmentation

    - `grade_before: Optional[str]`

      Letter grade (A-E) before augmentation

    - `improvement_percent: Optional[float]`

      Relative improvement percentage

    - `score_after: Optional[float]`

      Quality score after augmentation

    - `score_before: Optional[float]`

      Quality score before augmentation

  - `name: Optional[str]`

    Human-readable name for the dataset

  - `progress: Optional[Progress]`

    Processing progress. Null when no run is active.

    - `percent: Optional[int]`

      Progress percentage (0-100)

    - `processed_rows: Optional[int]`

      Number of rows processed so far

    - `total_rows: Optional[int]`

      Total rows to process (samples_to_process or row_count)

  - `row_count: Optional[int]`

    Total number of rows in the dataset

  - `run_id: Optional[str]`

    ID of the currently active run

  - `status: Literal["pending", "running", "succeeded", "failed"]`

    Lifecycle status: pending, running, succeeded, or failed

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: datetime`

    Timestamp of the last update

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
dataset = client.datasets.get(
    "dataset_id",
)
print(dataset.dataset_id)
```

#### Response

```json
{
  "configured_column_mapping": {
    "chat": "chat",
    "completion": "completion",
    "context": [
      "string"
    ],
    "prompt": "prompt"
  },
  "created_at": "2019-12-27T18:11:19.117Z",
  "dataset_id": "dataset_id",
  "error": {
    "message": "message"
  },
  "evaluation_summary": {
    "grade_after": "grade_after",
    "grade_before": "grade_before",
    "improvement_percent": 0,
    "score_after": 0,
    "score_before": 0
  },
  "name": "name",
  "progress": {
    "percent": 0,
    "processed_rows": 0,
    "total_rows": 0
  },
  "row_count": 0,
  "run_id": "run_id",
  "status": "pending",
  "updated_at": "2019-12-27T18:11:19.117Z"
}
```

## List datasets

`datasets.list(DatasetListParams**kwargs)  -> SyncCursor[DatasetListResponse]`

**get** `/api/v1/datasets`

List datasets

### Parameters

- `created_after: Optional[str]`

  ISO 8601 datetime — datasets created after this time.

- `created_before: Optional[str]`

  ISO 8601 datetime — datasets created before this time.

- `cursor: Optional[str]`

  Cursor from the previous response next_cursor field.

- `limit: Optional[float]`

  Number of results (max 100, default 20). Used with cursor pagination.

- `q: Optional[str]`

  Search by dataset name (case-insensitive contains).

- `sort: Optional[str]`

  Sort field: created_at | updated_at | name (default: created_at).

- `sort_direction: Optional[str]`

  Sort direction: asc | desc (default: desc).

- `status: Optional[str]`

  Filter by status: pending | running | succeeded | failed

### Returns

- `class DatasetListResponse: …`

  - `created_at: datetime`

    Timestamp when the dataset was created

  - `dataset_id: str`

    Dataset ID

  - `status: Literal["pending", "running", "succeeded", "failed"]`

    Dataset status

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: datetime`

    Last updated timestamp

  - `description: Optional[str]`

    Auto-generated description of the dataset contents

  - `name: Optional[str]`

    Dataset name

  - `row_count: Optional[int]`

    Total number of rows

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
page = client.datasets.list()
page = page.datasets[0]
print(page.dataset_id)
```

#### Response

```json
{
  "datasets": [
    {
      "created_at": "2019-12-27T18:11:19.117Z",
      "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "pending",
      "updated_at": "2019-12-27T18:11:19.117Z",
      "description": "description",
      "name": "My training data",
      "row_count": 1000
    }
  ],
  "next_cursor": "550e8400-e29b-41d4-a716-446655440000"
}
```

## Get the processing status of a dataset

`datasets.get_status(strdataset_id)  -> DatasetGetStatusResponse`

**get** `/api/v1/datasets/{dataset_id}/status`

Get the processing status of a dataset

### Parameters

- `dataset_id: str`

### Returns

- `class DatasetGetStatusResponse: …`

  - `dataset_id: str`

    Dataset ID

  - `error: Optional[Error]`

    Error details if the dataset failed. Null otherwise.

    - `message: str`

      Error message

  - `progress: Optional[Progress]`

    Processing progress. Null when no run is active.

    - `percent: Optional[int]`

      Progress percentage (0-100)

    - `processed_rows: Optional[int]`

      Number of rows processed so far

    - `total_rows: Optional[int]`

      Total rows to process (samples_to_process or row_count)

  - `row_count: Optional[int]`

    Number of rows in the dataset

  - `status: Literal["pending", "running", "succeeded", "failed"]`

    Current processing status

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.get_status(
    "dataset_id",
)
print(response.dataset_id)
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "error": {
    "message": "message"
  },
  "progress": {
    "percent": 0,
    "processed_rows": 0,
    "total_rows": 0
  },
  "row_count": 0,
  "status": "pending"
}
```

## Download the processed dataset

`datasets.download(strdataset_id, DatasetDownloadParams**kwargs)  -> DatasetDownloadResponse`

**get** `/api/v1/datasets/{dataset_id}/download`

Download the processed dataset

### Parameters

- `dataset_id: str`

- `file_format: Optional[Literal["csv", "json", "jsonl", "parquet"]]`

  Output file format. Defaults to the original upload format if omitted.

  - `"csv"`

  - `"json"`

  - `"jsonl"`

  - `"parquet"`

### Returns

- `object`

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.download(
    dataset_id="dataset_id",
)
print(response)
```

#### Response

```json
"Example data"
```

## Publish a dataset to an external platform

`datasets.publish(strdataset_id, DatasetPublishParams**kwargs)  -> DatasetPublishResponse`

**post** `/api/v1/datasets/{dataset_id}/publish`

Publishes the processed dataset to Hugging Face or Kaggle. Currently returns 501 — not yet implemented.

### Parameters

- `dataset_id: str`

- `target: Literal["huggingface", "kaggle"]`

  Destination platform for publishing the dataset

  - `"huggingface"`

  - `"kaggle"`

- `target_spec: Optional[Dict[str, object]]`

  Target-specific configuration (e.g. repo name for HuggingFace, slug for Kaggle)

### Returns

- `class DatasetPublishResponse: …`

  - `publish_id: str`

    Unique identifier for the publish job

  - `status: str`

    Status of the publish job

  - `message: Optional[str]`

    Additional information about the publish request

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.publish(
    dataset_id="dataset_id",
    target="huggingface",
)
print(response.publish_id)
```

#### Response

```json
{
  "publish_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "message": "message"
}
```

## Start an augmentation run (or estimate cost)

`datasets.run(strdataset_id, DatasetRunParams**kwargs)  -> DatasetRunResponse`

**post** `/api/v1/datasets/{dataset_id}/run`

Validates column mapping and recipe configuration, reserves credits, and starts the augmentation pipeline. Set estimate=true to validate and get a cost quote without starting a run.

### Parameters

- `dataset_id: str`

- `brand_controls: Optional[BrandControls]`

  Brand and quality controls for generated completions (length, safety, hallucination grounding).

  - `hallucination_mitigation: Optional[bool]`

    Enable web-search grounding to reduce hallucinations in generated completions

  - `length: Optional[Literal["minimal", "concise", "detailed", "extensive"]]`

    Target response length. Controls verbosity of generated completions.

    - `"minimal"`

    - `"concise"`

    - `"detailed"`

    - `"extensive"`

  - `safety_categories: Optional[SequenceNotStr[str]]`

    Content safety categories to enforce. Completions violating these are filtered.

- `column_mapping: Optional[ColumnMapping]`

  Column role assignments for augmentation. Required for real runs, optional for estimate-only requests.

  - `prompt: str`

    Column to use as the prompt/instruction field

  - `chat: Optional[str]`

    Column containing chat/conversation data (alternative to prompt+completion)

  - `completion: Optional[str]`

    Column to use as the completion/response field

  - `context: Optional[SequenceNotStr[str]]`

    Columns to include as context

- `estimate: Optional[bool]`

  When true, validates the request and returns the estimated credit cost without starting a run.

- `job_specification: Optional[JobSpecification]`

  Job execution parameters

  - `idempotency_key: Optional[str]`

    Client-generated idempotency key for safe retries. If a launch with the same key already exists, the original response is returned.

  - `max_rows: Optional[float]`

    Maximum number of rows to process in this run

- `recipe_specification: Optional[RecipeSpecification]`

  Augmentation recipe configuration. Omitted recipes use backend defaults.

  - `recipes: Optional[RecipeSpecificationRecipes]`

    Augmentation recipe toggles. Omitted recipes use backend defaults.

    - `deduplication: Optional[bool]`

      Remove near-duplicate rows

    - `preference_pairs: Optional[bool]`

      Generate DPO-style preference pairs (chosen/rejected) instead of instruction completions

    - `prompt_metadata_injection: Optional[bool]`

      Inject context and constraints into prompts

    - `prompt_rephrase: Optional[bool]`

      Rephrase prompts for variety and clarity

    - `reasoning_traces: Optional[bool]`

      Add reasoning traces (chain-of-thought) to completions

  - `version: Optional[str]`

    Recipe schema version. Allows recipe options to evolve across releases.

### Returns

- `class DatasetRunResponse: …`

  - `estimate: bool`

    Whether this was an estimate-only request (no run started)

  - `estimated_credits_consumed: float`

    Estimated number of credits that will be consumed by this run

  - `estimated_minutes: float`

    Estimated processing time in minutes

  - `run_id: Optional[str]`

    Unique identifier for this pipeline run. Null for estimate-only requests.

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.run(
    dataset_id="dataset_id",
)
print(response.run_id)
```

#### Response

```json
{
  "estimate": true,
  "estimatedCreditsConsumed": 0,
  "estimatedMinutes": 0,
  "run_id": "dataset-550e8400-e29b-41d4-a716-446655440000-1712234567890"
}
```

## Get evaluation results for a dataset

`datasets.get_evaluation(strdataset_id)  -> DatasetGetEvaluationResponse`

**get** `/api/v1/datasets/{dataset_id}/evaluation`

Get evaluation results for a dataset

### Parameters

- `dataset_id: str`

### Returns

- `class DatasetGetEvaluationResponse: …`

  - `dataset_id: str`

    Dataset ID

  - `quality: Optional[Quality]`

    Structured quality metrics. Null until evaluation completes.

    - `grade_after: Optional[str]`

      Letter grade (A-E) after augmentation

    - `grade_before: Optional[str]`

      Letter grade (A-E) before augmentation

    - `improvement_percent: Optional[float]`

      Relative quality improvement as a percentage

    - `percentile_after: Optional[float]`

      Percentile rank (0-100) after augmentation

    - `score_after: Optional[float]`

      Quality score (0-10) after augmentation

    - `score_before: Optional[float]`

      Quality score (0-10) before augmentation

  - `raw_results: Optional[Dict[str, object]]`

    Raw evaluation results payload for advanced use. Null until evaluation completes.

  - `status: Optional[str]`

    Evaluation pipeline status: pending | running | succeeded | failed | skipped

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.get_evaluation(
    "dataset_id",
)
print(response.dataset_id)
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "quality": {
    "grade_after": "A",
    "grade_before": "C",
    "improvement_percent": 37.1,
    "percentile_after": 92.3,
    "score_after": 8.5,
    "score_before": 6.2
  },
  "raw_results": {
    "foo": "bar"
  },
  "status": "succeeded"
}
```

## Domain Types

### Dataset

- `class Dataset: …`

  - `configured_column_mapping: Optional[ConfiguredColumnMapping]`

    User-configured column mapping. Null if not yet configured.

    - `chat: Optional[str]`

    - `completion: Optional[str]`

    - `context: List[str]`

    - `prompt: Optional[str]`

  - `created_at: datetime`

    Timestamp when the dataset was created

  - `dataset_id: str`

    Unique dataset identifier

  - `error: Optional[Error]`

    Error details if the dataset failed. Null otherwise.

    - `message: str`

      Error message

  - `evaluation_summary: Optional[EvaluationSummary]`

    Compact evaluation summary. Null if evaluation has not completed.

    - `grade_after: Optional[str]`

      Letter grade (A-E) after augmentation

    - `grade_before: Optional[str]`

      Letter grade (A-E) before augmentation

    - `improvement_percent: Optional[float]`

      Relative improvement percentage

    - `score_after: Optional[float]`

      Quality score after augmentation

    - `score_before: Optional[float]`

      Quality score before augmentation

  - `name: Optional[str]`

    Human-readable name for the dataset

  - `progress: Optional[Progress]`

    Processing progress. Null when no run is active.

    - `percent: Optional[int]`

      Progress percentage (0-100)

    - `processed_rows: Optional[int]`

      Number of rows processed so far

    - `total_rows: Optional[int]`

      Total rows to process (samples_to_process or row_count)

  - `row_count: Optional[int]`

    Total number of rows in the dataset

  - `run_id: Optional[str]`

    ID of the currently active run

  - `status: Literal["pending", "running", "succeeded", "failed"]`

    Lifecycle status: pending, running, succeeded, or failed

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: datetime`

    Timestamp of the last update

# Upload

## Initiate a dataset upload

`datasets.upload.initiate(UploadInitiateParams**kwargs)  -> UploadInitiateResponse`

**post** `/api/v1/datasets/upload/initiate`

Initiate a dataset upload

### Parameters

- `file_format: Literal["csv", "json", "jsonl", "parquet"]`

  Format of the file being uploaded

  - `"csv"`

  - `"json"`

  - `"jsonl"`

  - `"parquet"`

- `name: str`

  Human-readable name for the dataset

### Returns

- `class UploadInitiateResponse: …`

  - `upload_url: str`

    Pre-signed S3 URL — upload the file directly to this URL via HTTP PUT

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.upload.initiate(
    file_format="csv",
    name="my-training-data",
)
print(response.upload_url)
```

#### Response

```json
{
  "upload_url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..."
}
```

## Complete a dataset upload and trigger processing

`datasets.upload.complete(UploadCompleteParams**kwargs)  -> UploadCompleteResponse`

**post** `/api/v1/datasets/upload/complete`

Complete a dataset upload and trigger processing

### Parameters

- `file_format: Literal["csv", "json", "jsonl", "parquet"]`

  Format of the uploaded file

  - `"csv"`

  - `"json"`

  - `"jsonl"`

  - `"parquet"`

- `file_size_bytes: float`

  Size of the uploaded file in bytes

- `name: str`

  Human-readable name for the dataset

- `s3_key: str`

  S3 object key returned in the pre-signed URL response from /upload/initiate

### Returns

- `class UploadCompleteResponse: …`

  - `dataset_id: str`

    ID of the newly created dataset

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.upload.complete(
    file_format="csv",
    file_size_bytes=1048576,
    name="my-training-data",
    s3_key="uploads/550e8400-e29b-41d4-a716-446655440000/my-training-data.csv",
)
print(response.dataset_id)
```

#### Response

```json
{
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

## Complete a file upload and trigger processing

`datasets.upload.complete_by_id(strdataset_id, UploadCompleteByIDParams**kwargs)  -> UploadCompleteByIDResponse`

**post** `/api/v1/datasets/{dataset_id}/upload/complete`

File uploads only. Call after uploading bytes to the presigned URL from POST /datasets. Verifies the file exists in S3, then triggers the preprocessing pipeline.

### Parameters

- `dataset_id: str`

- `file_size_bytes: float`

  Size of the uploaded file in bytes (for verification)

- `sha256: Optional[str]`

  SHA-256 hex digest of the uploaded file (for integrity verification)

### Returns

- `class UploadCompleteByIDResponse: …`

  - `dataset_id: str`

    ID of the dataset

  - `status: str`

    Current status of the dataset after completing upload

### Example

```python
import os
from adaption import Adaption

client = Adaption(
    api_key=os.environ.get("ADAPTION_API_KEY"),  # This is the default and can be omitted
)
response = client.datasets.upload.complete_by_id(
    dataset_id="dataset_id",
    file_size_bytes=1048576,
)
print(response.dataset_id)
```

#### Response

```json
{
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing"
}
```