# Datasets ## Create a dataset from file upload, HuggingFace, or Kaggle `datasets.create(DatasetCreateParams**kwargs) -> DatasetCreateResponse` **post** `/api/v1/datasets` Unified ingest endpoint. Discriminated by source.type: "file" returns upload instructions for a presigned S3 PUT, "huggingface" and "kaggle" start an async import. ### Parameters - `source: Source` Dataset source configuration. Discriminated by `type`: file, huggingface, or kaggle. - `class SourceFileSourceDto: …` - `file_format: Literal["csv", "json", "jsonl", "parquet"]` Format of the file being uploaded - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` - `name: str` Human-readable name for the dataset - `type: Literal["file"]` Source type - `"file"` - `class SourceHuggingfaceSourceDto: …` - `files: SequenceNotStr[str]` File paths to download from the repository - `type: Literal["huggingface"]` Source type - `"huggingface"` - `url: str` HuggingFace dataset repository URL - `class SourceKaggleSourceDto: …` - `files: SequenceNotStr[str]` File paths to download from the dataset - `type: Literal["kaggle"]` Source type - `"kaggle"` - `url: str` Kaggle dataset URL ### Returns - `class DatasetCreateResponse: …` - `dataset_id: str` ID of the newly created dataset - `status: str` Current dataset status - `upload_instructions: Optional[UploadInstructions]` Upload instructions for file sources. PUT your file to the provided URL. - `method: str` HTTP method to use - `s3_key: str` S3 object key — pass this back in the complete request if needed for verification - `url: str` Pre-signed URL for uploading the file ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) dataset = client.datasets.create( source={ "file_format": "csv", "name": "my-training-data", "type": "file", }, ) print(dataset.dataset_id) ``` #### Response ```json { "dataset_id": "dataset_id", "status": "status", "upload_instructions": { "method": "PUT", "s3_key": "s3_key", "url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..." } } ``` ## Get a dataset by ID `datasets.get(strdataset_id) -> Dataset` **get** `/api/v1/datasets/{dataset_id}` Get a dataset by ID ### Parameters - `dataset_id: str` ### Returns - `class Dataset: …` - `configured_column_mapping: Optional[ConfiguredColumnMapping]` User-configured column mapping. Null if not yet configured. - `chat: Optional[str]` - `completion: Optional[str]` - `context: List[str]` - `prompt: Optional[str]` - `created_at: datetime` Timestamp when the dataset was created - `dataset_id: str` Unique dataset identifier - `error: Optional[Error]` Error details if the dataset failed. Null otherwise. - `message: str` Error message - `evaluation_summary: Optional[EvaluationSummary]` Compact evaluation summary. Null if evaluation has not completed. - `grade_after: Optional[str]` Letter grade (A-E) after augmentation - `grade_before: Optional[str]` Letter grade (A-E) before augmentation - `improvement_percent: Optional[float]` Relative improvement percentage - `score_after: Optional[float]` Quality score after augmentation - `score_before: Optional[float]` Quality score before augmentation - `name: Optional[str]` Human-readable name for the dataset - `progress: Optional[Progress]` Processing progress. Null when no run is active. - `percent: Optional[int]` Progress percentage (0-100) - `processed_rows: Optional[int]` Number of rows processed so far - `total_rows: Optional[int]` Total rows to process (samples_to_process or row_count) - `row_count: Optional[int]` Total number of rows in the dataset - `run_id: Optional[str]` ID of the currently active run - `status: Literal["pending", "running", "succeeded", "failed"]` Lifecycle status: pending, running, succeeded, or failed - `"pending"` - `"running"` - `"succeeded"` - `"failed"` - `updated_at: datetime` Timestamp of the last update ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) dataset = client.datasets.get( "dataset_id", ) print(dataset.dataset_id) ``` #### Response ```json { "configured_column_mapping": { "chat": "chat", "completion": "completion", "context": [ "string" ], "prompt": "prompt" }, "created_at": "2019-12-27T18:11:19.117Z", "dataset_id": "dataset_id", "error": { "message": "message" }, "evaluation_summary": { "grade_after": "grade_after", "grade_before": "grade_before", "improvement_percent": 0, "score_after": 0, "score_before": 0 }, "name": "name", "progress": { "percent": 0, "processed_rows": 0, "total_rows": 0 }, "row_count": 0, "run_id": "run_id", "status": "pending", "updated_at": "2019-12-27T18:11:19.117Z" } ``` ## List datasets `datasets.list(DatasetListParams**kwargs) -> SyncCursor[DatasetListResponse]` **get** `/api/v1/datasets` List datasets ### Parameters - `created_after: Optional[str]` ISO 8601 datetime — datasets created after this time. - `created_before: Optional[str]` ISO 8601 datetime — datasets created before this time. - `cursor: Optional[str]` Cursor from the previous response next_cursor field. - `limit: Optional[float]` Number of results (max 100, default 20). Used with cursor pagination. - `q: Optional[str]` Search by dataset name (case-insensitive contains). - `sort: Optional[str]` Sort field: created_at | updated_at | name (default: created_at). - `sort_direction: Optional[str]` Sort direction: asc | desc (default: desc). - `status: Optional[str]` Filter by status: pending | running | succeeded | failed ### Returns - `class DatasetListResponse: …` - `created_at: datetime` Timestamp when the dataset was created - `dataset_id: str` Dataset ID - `status: Literal["pending", "running", "succeeded", "failed"]` Dataset status - `"pending"` - `"running"` - `"succeeded"` - `"failed"` - `updated_at: datetime` Last updated timestamp - `description: Optional[str]` Auto-generated description of the dataset contents - `name: Optional[str]` Dataset name - `row_count: Optional[int]` Total number of rows ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) page = client.datasets.list() page = page.datasets[0] print(page.dataset_id) ``` #### Response ```json { "datasets": [ { "created_at": "2019-12-27T18:11:19.117Z", "dataset_id": "550e8400-e29b-41d4-a716-446655440000", "status": "pending", "updated_at": "2019-12-27T18:11:19.117Z", "description": "description", "name": "My training data", "row_count": 1000 } ], "next_cursor": "550e8400-e29b-41d4-a716-446655440000" } ``` ## Get the processing status of a dataset `datasets.get_status(strdataset_id) -> DatasetGetStatusResponse` **get** `/api/v1/datasets/{dataset_id}/status` Get the processing status of a dataset ### Parameters - `dataset_id: str` ### Returns - `class DatasetGetStatusResponse: …` - `dataset_id: str` Dataset ID - `error: Optional[Error]` Error details if the dataset failed. Null otherwise. - `message: str` Error message - `progress: Optional[Progress]` Processing progress. Null when no run is active. - `percent: Optional[int]` Progress percentage (0-100) - `processed_rows: Optional[int]` Number of rows processed so far - `total_rows: Optional[int]` Total rows to process (samples_to_process or row_count) - `row_count: Optional[int]` Number of rows in the dataset - `status: Literal["pending", "running", "succeeded", "failed"]` Current processing status - `"pending"` - `"running"` - `"succeeded"` - `"failed"` ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.get_status( "dataset_id", ) print(response.dataset_id) ``` #### Response ```json { "dataset_id": "dataset_id", "error": { "message": "message" }, "progress": { "percent": 0, "processed_rows": 0, "total_rows": 0 }, "row_count": 0, "status": "pending" } ``` ## Download the processed dataset `datasets.download(strdataset_id, DatasetDownloadParams**kwargs) -> DatasetDownloadResponse` **get** `/api/v1/datasets/{dataset_id}/download` Download the processed dataset ### Parameters - `dataset_id: str` - `file_format: Optional[Literal["csv", "json", "jsonl", "parquet"]]` Output file format. Defaults to the original upload format if omitted. - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` ### Returns - `object` ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.download( dataset_id="dataset_id", ) print(response) ``` #### Response ```json "Example data" ``` ## Publish a dataset to an external platform `datasets.publish(strdataset_id, DatasetPublishParams**kwargs) -> DatasetPublishResponse` **post** `/api/v1/datasets/{dataset_id}/publish` Publishes the processed dataset to Hugging Face or Kaggle. Currently returns 501 — not yet implemented. ### Parameters - `dataset_id: str` - `target: Literal["huggingface", "kaggle"]` Destination platform for publishing the dataset - `"huggingface"` - `"kaggle"` - `target_spec: Optional[Dict[str, object]]` Target-specific configuration (e.g. repo name for HuggingFace, slug for Kaggle) ### Returns - `class DatasetPublishResponse: …` - `publish_id: str` Unique identifier for the publish job - `status: str` Status of the publish job - `message: Optional[str]` Additional information about the publish request ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.publish( dataset_id="dataset_id", target="huggingface", ) print(response.publish_id) ``` #### Response ```json { "publish_id": "550e8400-e29b-41d4-a716-446655440000", "status": "queued", "message": "message" } ``` ## Start an augmentation run (or estimate cost) `datasets.run(strdataset_id, DatasetRunParams**kwargs) -> DatasetRunResponse` **post** `/api/v1/datasets/{dataset_id}/run` Validates column mapping and recipe configuration, reserves credits, and starts the augmentation pipeline. Set estimate=true to validate and get a cost quote without starting a run. ### Parameters - `dataset_id: str` - `brand_controls: Optional[BrandControls]` Brand and quality controls for generated completions (length, safety, hallucination grounding). - `hallucination_mitigation: Optional[bool]` Enable web-search grounding to reduce hallucinations in generated completions - `length: Optional[Literal["minimal", "concise", "detailed", "extensive"]]` Target response length. Controls verbosity of generated completions. - `"minimal"` - `"concise"` - `"detailed"` - `"extensive"` - `safety_categories: Optional[SequenceNotStr[str]]` Content safety categories to enforce. Completions violating these are filtered. - `column_mapping: Optional[ColumnMapping]` Column role assignments for augmentation. Required for real runs, optional for estimate-only requests. - `prompt: str` Column to use as the prompt/instruction field - `chat: Optional[str]` Column containing chat/conversation data (alternative to prompt+completion) - `completion: Optional[str]` Column to use as the completion/response field - `context: Optional[SequenceNotStr[str]]` Columns to include as context - `estimate: Optional[bool]` When true, validates the request and returns the estimated credit cost without starting a run. - `job_specification: Optional[JobSpecification]` Job execution parameters - `idempotency_key: Optional[str]` Client-generated idempotency key for safe retries. If a launch with the same key already exists, the original response is returned. - `max_rows: Optional[float]` Maximum number of rows to process in this run - `recipe_specification: Optional[RecipeSpecification]` Augmentation recipe configuration. Omitted recipes use backend defaults. - `recipes: Optional[RecipeSpecificationRecipes]` Augmentation recipe toggles. Omitted recipes use backend defaults. - `deduplication: Optional[bool]` Remove near-duplicate rows - `preference_pairs: Optional[bool]` Generate DPO-style preference pairs (chosen/rejected) instead of instruction completions - `prompt_metadata_injection: Optional[bool]` Inject context and constraints into prompts - `prompt_rephrase: Optional[bool]` Rephrase prompts for variety and clarity - `reasoning_traces: Optional[bool]` Add reasoning traces (chain-of-thought) to completions - `version: Optional[str]` Recipe schema version. Allows recipe options to evolve across releases. ### Returns - `class DatasetRunResponse: …` - `estimate: bool` Whether this was an estimate-only request (no run started) - `estimated_credits_consumed: float` Estimated number of credits that will be consumed by this run - `estimated_minutes: float` Estimated processing time in minutes - `run_id: Optional[str]` Unique identifier for this pipeline run. Null for estimate-only requests. ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.run( dataset_id="dataset_id", ) print(response.run_id) ``` #### Response ```json { "estimate": true, "estimatedCreditsConsumed": 0, "estimatedMinutes": 0, "run_id": "dataset-550e8400-e29b-41d4-a716-446655440000-1712234567890" } ``` ## Get evaluation results for a dataset `datasets.get_evaluation(strdataset_id) -> DatasetGetEvaluationResponse` **get** `/api/v1/datasets/{dataset_id}/evaluation` Get evaluation results for a dataset ### Parameters - `dataset_id: str` ### Returns - `class DatasetGetEvaluationResponse: …` - `dataset_id: str` Dataset ID - `quality: Optional[Quality]` Structured quality metrics. Null until evaluation completes. - `grade_after: Optional[str]` Letter grade (A-E) after augmentation - `grade_before: Optional[str]` Letter grade (A-E) before augmentation - `improvement_percent: Optional[float]` Relative quality improvement as a percentage - `percentile_after: Optional[float]` Percentile rank (0-100) after augmentation - `score_after: Optional[float]` Quality score (0-10) after augmentation - `score_before: Optional[float]` Quality score (0-10) before augmentation - `raw_results: Optional[Dict[str, object]]` Raw evaluation results payload for advanced use. Null until evaluation completes. - `status: Optional[str]` Evaluation pipeline status: pending | running | succeeded | failed | skipped ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.get_evaluation( "dataset_id", ) print(response.dataset_id) ``` #### Response ```json { "dataset_id": "dataset_id", "quality": { "grade_after": "A", "grade_before": "C", "improvement_percent": 37.1, "percentile_after": 92.3, "score_after": 8.5, "score_before": 6.2 }, "raw_results": { "foo": "bar" }, "status": "succeeded" } ``` ## Domain Types ### Dataset - `class Dataset: …` - `configured_column_mapping: Optional[ConfiguredColumnMapping]` User-configured column mapping. Null if not yet configured. - `chat: Optional[str]` - `completion: Optional[str]` - `context: List[str]` - `prompt: Optional[str]` - `created_at: datetime` Timestamp when the dataset was created - `dataset_id: str` Unique dataset identifier - `error: Optional[Error]` Error details if the dataset failed. Null otherwise. - `message: str` Error message - `evaluation_summary: Optional[EvaluationSummary]` Compact evaluation summary. Null if evaluation has not completed. - `grade_after: Optional[str]` Letter grade (A-E) after augmentation - `grade_before: Optional[str]` Letter grade (A-E) before augmentation - `improvement_percent: Optional[float]` Relative improvement percentage - `score_after: Optional[float]` Quality score after augmentation - `score_before: Optional[float]` Quality score before augmentation - `name: Optional[str]` Human-readable name for the dataset - `progress: Optional[Progress]` Processing progress. Null when no run is active. - `percent: Optional[int]` Progress percentage (0-100) - `processed_rows: Optional[int]` Number of rows processed so far - `total_rows: Optional[int]` Total rows to process (samples_to_process or row_count) - `row_count: Optional[int]` Total number of rows in the dataset - `run_id: Optional[str]` ID of the currently active run - `status: Literal["pending", "running", "succeeded", "failed"]` Lifecycle status: pending, running, succeeded, or failed - `"pending"` - `"running"` - `"succeeded"` - `"failed"` - `updated_at: datetime` Timestamp of the last update # Upload ## Initiate a dataset upload `datasets.upload.initiate(UploadInitiateParams**kwargs) -> UploadInitiateResponse` **post** `/api/v1/datasets/upload/initiate` Initiate a dataset upload ### Parameters - `file_format: Literal["csv", "json", "jsonl", "parquet"]` Format of the file being uploaded - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` - `name: str` Human-readable name for the dataset ### Returns - `class UploadInitiateResponse: …` - `upload_url: str` Pre-signed S3 URL — upload the file directly to this URL via HTTP PUT ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.upload.initiate( file_format="csv", name="my-training-data", ) print(response.upload_url) ``` #### Response ```json { "upload_url": "https://s3.amazonaws.com/bucket/key?X-Amz-Signature=..." } ``` ## Complete a dataset upload and trigger processing `datasets.upload.complete(UploadCompleteParams**kwargs) -> UploadCompleteResponse` **post** `/api/v1/datasets/upload/complete` Complete a dataset upload and trigger processing ### Parameters - `file_format: Literal["csv", "json", "jsonl", "parquet"]` Format of the uploaded file - `"csv"` - `"json"` - `"jsonl"` - `"parquet"` - `file_size_bytes: float` Size of the uploaded file in bytes - `name: str` Human-readable name for the dataset - `s3_key: str` S3 object key returned in the pre-signed URL response from /upload/initiate ### Returns - `class UploadCompleteResponse: …` - `dataset_id: str` ID of the newly created dataset ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.upload.complete( file_format="csv", file_size_bytes=1048576, name="my-training-data", s3_key="uploads/550e8400-e29b-41d4-a716-446655440000/my-training-data.csv", ) print(response.dataset_id) ``` #### Response ```json { "dataset_id": "550e8400-e29b-41d4-a716-446655440000" } ``` ## Complete a file upload and trigger processing `datasets.upload.complete_by_id(strdataset_id, UploadCompleteByIDParams**kwargs) -> UploadCompleteByIDResponse` **post** `/api/v1/datasets/{dataset_id}/upload/complete` File uploads only. Call after uploading bytes to the presigned URL from POST /datasets. Verifies the file exists in S3, then triggers the preprocessing pipeline. ### Parameters - `dataset_id: str` - `file_size_bytes: float` Size of the uploaded file in bytes (for verification) - `sha256: Optional[str]` SHA-256 hex digest of the uploaded file (for integrity verification) ### Returns - `class UploadCompleteByIDResponse: …` - `dataset_id: str` ID of the dataset - `status: str` Current status of the dataset after completing upload ### Example ```python import os from adaption import Adaption client = Adaption( api_key=os.environ.get("ADAPTION_API_KEY"), # This is the default and can be omitted ) response = client.datasets.upload.complete_by_id( dataset_id="dataset_id", file_size_bytes=1048576, ) print(response.dataset_id) ``` #### Response ```json { "dataset_id": "550e8400-e29b-41d4-a716-446655440000", "status": "processing" } ```