# Datasets

## Create a dataset from file upload, HuggingFace, or Kaggle

`client.datasets.create(DatasetCreateParamsbody, RequestOptionsoptions?): DatasetCreateResponse`

**post** `/api/v1/datasets`

Unified ingest endpoint. Discriminated by source.type: "file" returns upload instructions for a presigned S3 PUT, "huggingface" and "kaggle" start an async import.

### Parameters

- `body: DatasetCreateParams`

  - `source: FileSourceDto | HuggingfaceSourceDto | KaggleSourceDto`

    Dataset source configuration. Discriminated by `type`: file, huggingface, or kaggle.

    - `FileSourceDto`

      - `file_format: "csv" | "json" | "jsonl" | "parquet"`

        Format of the file being uploaded

        - `"csv"`

        - `"json"`

        - `"jsonl"`

        - `"parquet"`

      - `name: string`

        Human-readable name for the dataset

      - `type: "file"`

        Source type

        - `"file"`

    - `HuggingfaceSourceDto`

      - `files: Array<string>`

        File paths to download from the repository

      - `type: "huggingface"`

        Source type

        - `"huggingface"`

      - `url: string`

        HuggingFace dataset repository URL

    - `KaggleSourceDto`

      - `files: Array<string>`

        File paths to download from the dataset

      - `type: "kaggle"`

        Source type

        - `"kaggle"`

      - `url: string`

        Kaggle dataset URL

### Returns

- `DatasetCreateResponse`

  - `dataset_id: string`

    ID of the newly created dataset

  - `status: string`

    Current dataset status

  - `upload_instructions?: UploadInstructions`

    Upload instructions for file sources. PUT your file to the provided URL.

    - `method: string`

      HTTP method to use

    - `s3_key: string`

      S3 object key — pass this back in the complete request if needed for verification

    - `url: string`

      Pre-signed URL for uploading the file

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const dataset = await client.datasets.create({
  source: {
    file_format: 'csv',
    name: 'my-training-data',
    type: 'file',
  },
});

console.log(dataset.dataset_id);
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "status": "status",
  "upload_instructions": {
    "method": "PUT",
    "s3_key": "s3_key",
    "url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..."
  }
}
```

## Get a dataset by ID

`client.datasets.get(stringdatasetID, RequestOptionsoptions?): Dataset`

**get** `/api/v1/datasets/{dataset_id}`

Get a dataset by ID

### Parameters

- `datasetID: string`

### Returns

- `Dataset`

  - `configured_column_mapping: ConfiguredColumnMapping | null`

    User-configured column mapping. Null if not yet configured.

    - `chat: string | null`

    - `completion: string | null`

    - `context: Array<string>`

    - `prompt: string | null`

  - `created_at: string`

    Timestamp when the dataset was created

  - `dataset_id: string`

    Unique dataset identifier

  - `error: Error | null`

    Error details if the dataset failed. Null otherwise.

    - `message: string`

      Error message

  - `evaluation_summary: EvaluationSummary | null`

    Compact evaluation summary. Null if evaluation has not completed.

    - `grade_after: string | null`

      Letter grade (A-E) after augmentation

    - `grade_before: string | null`

      Letter grade (A-E) before augmentation

    - `improvement_percent: number | null`

      Relative improvement percentage

    - `score_after: number | null`

      Quality score after augmentation

    - `score_before: number | null`

      Quality score before augmentation

  - `name: string | null`

    Human-readable name for the dataset

  - `progress: Progress | null`

    Processing progress. Null when no run is active.

    - `percent: number | null`

      Progress percentage (0-100)

    - `processed_rows: number | null`

      Number of rows processed so far

    - `total_rows: number | null`

      Total rows to process (samples_to_process or row_count)

  - `row_count: number | null`

    Total number of rows in the dataset

  - `run_id: string | null`

    ID of the currently active run

  - `status: "pending" | "running" | "succeeded" | "failed"`

    Lifecycle status: pending, running, succeeded, or failed

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: string`

    Timestamp of the last update

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const dataset = await client.datasets.get('dataset_id');

console.log(dataset.dataset_id);
```

#### Response

```json
{
  "configured_column_mapping": {
    "chat": "chat",
    "completion": "completion",
    "context": [
      "string"
    ],
    "prompt": "prompt"
  },
  "created_at": "2019-12-27T18:11:19.117Z",
  "dataset_id": "dataset_id",
  "error": {
    "message": "message"
  },
  "evaluation_summary": {
    "grade_after": "grade_after",
    "grade_before": "grade_before",
    "improvement_percent": 0,
    "score_after": 0,
    "score_before": 0
  },
  "name": "name",
  "progress": {
    "percent": 0,
    "processed_rows": 0,
    "total_rows": 0
  },
  "row_count": 0,
  "run_id": "run_id",
  "status": "pending",
  "updated_at": "2019-12-27T18:11:19.117Z"
}
```

## List datasets

`client.datasets.list(DatasetListParamsquery?, RequestOptionsoptions?): Cursor<DatasetListResponse>`

**get** `/api/v1/datasets`

List datasets

### Parameters

- `query: DatasetListParams`

  - `created_after?: string`

    ISO 8601 datetime — datasets created after this time.

  - `created_before?: string`

    ISO 8601 datetime — datasets created before this time.

  - `cursor?: string`

    Cursor from the previous response next_cursor field.

  - `limit?: number`

    Number of results (max 100, default 20). Used with cursor pagination.

  - `q?: string`

    Search by dataset name (case-insensitive contains).

  - `sort?: string`

    Sort field: created_at | updated_at | name (default: created_at).

  - `sort_direction?: string`

    Sort direction: asc | desc (default: desc).

  - `status?: string`

    Filter by status: pending | running | succeeded | failed

### Returns

- `DatasetListResponse`

  - `created_at: string`

    Timestamp when the dataset was created

  - `dataset_id: string`

    Dataset ID

  - `status: "pending" | "running" | "succeeded" | "failed"`

    Dataset status

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: string`

    Last updated timestamp

  - `description?: string | null`

    Auto-generated description of the dataset contents

  - `name?: string | null`

    Dataset name

  - `row_count?: number | null`

    Total number of rows

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

// Automatically fetches more pages as needed.
for await (const datasetListResponse of client.datasets.list()) {
  console.log(datasetListResponse.dataset_id);
}
```

#### Response

```json
{
  "datasets": [
    {
      "created_at": "2019-12-27T18:11:19.117Z",
      "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "pending",
      "updated_at": "2019-12-27T18:11:19.117Z",
      "description": "description",
      "name": "My training data",
      "row_count": 1000
    }
  ],
  "next_cursor": "550e8400-e29b-41d4-a716-446655440000"
}
```

## Get the processing status of a dataset

`client.datasets.getStatus(stringdatasetID, RequestOptionsoptions?): DatasetGetStatusResponse`

**get** `/api/v1/datasets/{dataset_id}/status`

Get the processing status of a dataset

### Parameters

- `datasetID: string`

### Returns

- `DatasetGetStatusResponse`

  - `dataset_id: string`

    Dataset ID

  - `error: Error | null`

    Error details if the dataset failed. Null otherwise.

    - `message: string`

      Error message

  - `progress: Progress | null`

    Processing progress. Null when no run is active.

    - `percent: number | null`

      Progress percentage (0-100)

    - `processed_rows: number | null`

      Number of rows processed so far

    - `total_rows: number | null`

      Total rows to process (samples_to_process or row_count)

  - `row_count: number | null`

    Number of rows in the dataset

  - `status: "pending" | "running" | "succeeded" | "failed"`

    Current processing status

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.getStatus('dataset_id');

console.log(response.dataset_id);
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "error": {
    "message": "message"
  },
  "progress": {
    "percent": 0,
    "processed_rows": 0,
    "total_rows": 0
  },
  "row_count": 0,
  "status": "pending"
}
```

## Download the processed dataset

`client.datasets.download(stringdatasetID, DatasetDownloadParamsquery?, RequestOptionsoptions?): DatasetDownloadResponse`

**get** `/api/v1/datasets/{dataset_id}/download`

Download the processed dataset

### Parameters

- `datasetID: string`

- `query: DatasetDownloadParams`

  - `fileFormat?: "csv" | "json" | "jsonl" | "parquet"`

    Output file format. Defaults to the original upload format if omitted.

    - `"csv"`

    - `"json"`

    - `"jsonl"`

    - `"parquet"`

### Returns

- `DatasetDownloadResponse = Uploadable`

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.download('dataset_id');

console.log(response);
```

#### Response

```json
"Example data"
```

## Publish a dataset to an external platform

`client.datasets.publish(stringdatasetID, DatasetPublishParamsbody, RequestOptionsoptions?): DatasetPublishResponse`

**post** `/api/v1/datasets/{dataset_id}/publish`

Publishes the processed dataset to Hugging Face or Kaggle. Currently returns 501 — not yet implemented.

### Parameters

- `datasetID: string`

- `body: DatasetPublishParams`

  - `target: "huggingface" | "kaggle"`

    Destination platform for publishing the dataset

    - `"huggingface"`

    - `"kaggle"`

  - `target_spec?: Record<string, unknown>`

    Target-specific configuration (e.g. repo name for HuggingFace, slug for Kaggle)

### Returns

- `DatasetPublishResponse`

  - `publish_id: string`

    Unique identifier for the publish job

  - `status: string`

    Status of the publish job

  - `message?: string`

    Additional information about the publish request

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.publish('dataset_id', { target: 'huggingface' });

console.log(response.publish_id);
```

#### Response

```json
{
  "publish_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "message": "message"
}
```

## Start an augmentation run (or estimate cost)

`client.datasets.run(stringdatasetID, DatasetRunParamsbody, RequestOptionsoptions?): DatasetRunResponse`

**post** `/api/v1/datasets/{dataset_id}/run`

Validates column mapping and recipe configuration, reserves credits, and starts the augmentation pipeline. Set estimate=true to validate and get a cost quote without starting a run.

### Parameters

- `datasetID: string`

- `body: DatasetRunParams`

  - `brand_controls?: BrandControls`

    Brand and quality controls for generated completions (length, safety, hallucination grounding).

    - `hallucination_mitigation?: boolean`

      Enable web-search grounding to reduce hallucinations in generated completions

    - `length?: "minimal" | "concise" | "detailed" | "extensive"`

      Target response length. Controls verbosity of generated completions.

      - `"minimal"`

      - `"concise"`

      - `"detailed"`

      - `"extensive"`

    - `safety_categories?: Array<string>`

      Content safety categories to enforce. Completions violating these are filtered.

  - `column_mapping?: ColumnMapping`

    Column role assignments for augmentation. Required for real runs, optional for estimate-only requests.

    - `prompt: string`

      Column to use as the prompt/instruction field

    - `chat?: string`

      Column containing chat/conversation data (alternative to prompt+completion)

    - `completion?: string`

      Column to use as the completion/response field

    - `context?: Array<string>`

      Columns to include as context

  - `estimate?: boolean`

    When true, validates the request and returns the estimated credit cost without starting a run.

  - `job_specification?: JobSpecification`

    Job execution parameters

    - `idempotency_key?: string`

      Client-generated idempotency key for safe retries. If a launch with the same key already exists, the original response is returned.

    - `max_rows?: number`

      Maximum number of rows to process in this run

  - `recipe_specification?: RecipeSpecification`

    Augmentation recipe configuration. Omitted recipes use backend defaults.

    - `recipes?: Recipes`

      Augmentation recipe toggles. Omitted recipes use backend defaults.

      - `deduplication?: boolean`

        Remove near-duplicate rows

      - `preference_pairs?: boolean`

        Generate DPO-style preference pairs (chosen/rejected) instead of instruction completions

      - `prompt_metadata_injection?: boolean`

        Inject context and constraints into prompts

      - `prompt_rephrase?: boolean`

        Rephrase prompts for variety and clarity

      - `reasoning_traces?: boolean`

        Add reasoning traces (chain-of-thought) to completions

    - `version?: string`

      Recipe schema version. Allows recipe options to evolve across releases.

### Returns

- `DatasetRunResponse`

  - `estimate: boolean`

    Whether this was an estimate-only request (no run started)

  - `estimatedCreditsConsumed: number`

    Estimated number of credits that will be consumed by this run

  - `estimatedMinutes: number`

    Estimated processing time in minutes

  - `run_id?: string | null`

    Unique identifier for this pipeline run. Null for estimate-only requests.

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.run('dataset_id');

console.log(response.run_id);
```

#### Response

```json
{
  "estimate": true,
  "estimatedCreditsConsumed": 0,
  "estimatedMinutes": 0,
  "run_id": "dataset-550e8400-e29b-41d4-a716-446655440000-1712234567890"
}
```

## Get evaluation results for a dataset

`client.datasets.getEvaluation(stringdatasetID, RequestOptionsoptions?): DatasetGetEvaluationResponse`

**get** `/api/v1/datasets/{dataset_id}/evaluation`

Get evaluation results for a dataset

### Parameters

- `datasetID: string`

### Returns

- `DatasetGetEvaluationResponse`

  - `dataset_id: string`

    Dataset ID

  - `quality: Quality | null`

    Structured quality metrics. Null until evaluation completes.

    - `grade_after: string | null`

      Letter grade (A-E) after augmentation

    - `grade_before: string | null`

      Letter grade (A-E) before augmentation

    - `improvement_percent: number | null`

      Relative quality improvement as a percentage

    - `percentile_after: number | null`

      Percentile rank (0-100) after augmentation

    - `score_after: number | null`

      Quality score (0-10) after augmentation

    - `score_before: number | null`

      Quality score (0-10) before augmentation

  - `raw_results: Record<string, unknown> | null`

    Raw evaluation results payload for advanced use. Null until evaluation completes.

  - `status: string | null`

    Evaluation pipeline status: pending | running | succeeded | failed | skipped

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.getEvaluation('dataset_id');

console.log(response.dataset_id);
```

#### Response

```json
{
  "dataset_id": "dataset_id",
  "quality": {
    "grade_after": "A",
    "grade_before": "C",
    "improvement_percent": 37.1,
    "percentile_after": 92.3,
    "score_after": 8.5,
    "score_before": 6.2
  },
  "raw_results": {
    "foo": "bar"
  },
  "status": "succeeded"
}
```

## Domain Types

### Dataset

- `Dataset`

  - `configured_column_mapping: ConfiguredColumnMapping | null`

    User-configured column mapping. Null if not yet configured.

    - `chat: string | null`

    - `completion: string | null`

    - `context: Array<string>`

    - `prompt: string | null`

  - `created_at: string`

    Timestamp when the dataset was created

  - `dataset_id: string`

    Unique dataset identifier

  - `error: Error | null`

    Error details if the dataset failed. Null otherwise.

    - `message: string`

      Error message

  - `evaluation_summary: EvaluationSummary | null`

    Compact evaluation summary. Null if evaluation has not completed.

    - `grade_after: string | null`

      Letter grade (A-E) after augmentation

    - `grade_before: string | null`

      Letter grade (A-E) before augmentation

    - `improvement_percent: number | null`

      Relative improvement percentage

    - `score_after: number | null`

      Quality score after augmentation

    - `score_before: number | null`

      Quality score before augmentation

  - `name: string | null`

    Human-readable name for the dataset

  - `progress: Progress | null`

    Processing progress. Null when no run is active.

    - `percent: number | null`

      Progress percentage (0-100)

    - `processed_rows: number | null`

      Number of rows processed so far

    - `total_rows: number | null`

      Total rows to process (samples_to_process or row_count)

  - `row_count: number | null`

    Total number of rows in the dataset

  - `run_id: string | null`

    ID of the currently active run

  - `status: "pending" | "running" | "succeeded" | "failed"`

    Lifecycle status: pending, running, succeeded, or failed

    - `"pending"`

    - `"running"`

    - `"succeeded"`

    - `"failed"`

  - `updated_at: string`

    Timestamp of the last update

# Upload

## Initiate a dataset upload

`client.datasets.upload.initiate(UploadInitiateParamsbody, RequestOptionsoptions?): UploadInitiateResponse`

**post** `/api/v1/datasets/upload/initiate`

Initiate a dataset upload

### Parameters

- `body: UploadInitiateParams`

  - `file_format: "csv" | "json" | "jsonl" | "parquet"`

    Format of the file being uploaded

    - `"csv"`

    - `"json"`

    - `"jsonl"`

    - `"parquet"`

  - `name: string`

    Human-readable name for the dataset

### Returns

- `UploadInitiateResponse`

  - `upload_url: string`

    Pre-signed S3 URL — upload the file directly to this URL via HTTP PUT

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.upload.initiate({
  file_format: 'csv',
  name: 'my-training-data',
});

console.log(response.upload_url);
```

#### Response

```json
{
  "upload_url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..."
}
```

## Complete a dataset upload and trigger processing

`client.datasets.upload.complete(UploadCompleteParamsbody, RequestOptionsoptions?): UploadCompleteResponse`

**post** `/api/v1/datasets/upload/complete`

Complete a dataset upload and trigger processing

### Parameters

- `body: UploadCompleteParams`

  - `file_format: "csv" | "json" | "jsonl" | "parquet"`

    Format of the uploaded file

    - `"csv"`

    - `"json"`

    - `"jsonl"`

    - `"parquet"`

  - `file_size_bytes: number`

    Size of the uploaded file in bytes

  - `name: string`

    Human-readable name for the dataset

  - `s3_key: string`

    S3 object key returned in the pre-signed URL response from /upload/initiate

### Returns

- `UploadCompleteResponse`

  - `dataset_id: string`

    ID of the newly created dataset

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.upload.complete({
  file_format: 'csv',
  file_size_bytes: 1048576,
  name: 'my-training-data',
  s3_key: 'uploads/550e8400-e29b-41d4-a716-446655440000/my-training-data.csv',
});

console.log(response.dataset_id);
```

#### Response

```json
{
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

## Complete a file upload and trigger processing

`client.datasets.upload.completeByID(stringdatasetID, UploadCompleteByIDParamsbody, RequestOptionsoptions?): UploadCompleteByIDResponse`

**post** `/api/v1/datasets/{dataset_id}/upload/complete`

File uploads only. Call after uploading bytes to the presigned URL from POST /datasets. Verifies the file exists in S3, then triggers the preprocessing pipeline.

### Parameters

- `datasetID: string`

- `body: UploadCompleteByIDParams`

  - `file_size_bytes: number`

    Size of the uploaded file in bytes (for verification)

  - `sha256?: string`

    SHA-256 hex digest of the uploaded file (for integrity verification)

### Returns

- `UploadCompleteByIDResponse`

  - `dataset_id: string`

    ID of the dataset

  - `status: string`

    Current status of the dataset after completing upload

### Example

```typescript
import Adaption from 'adaption';

const client = new Adaption({
  apiKey: process.env['ADAPTION_API_KEY'], // This is the default and can be omitted
});

const response = await client.datasets.upload.completeByID('dataset_id', {
  file_size_bytes: 1048576,
});

console.log(response.dataset_id);
```

#### Response

```json
{
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing"
}
```