Processing large datasets
Run Adaptive Data on a subset of rows for pilots and cost checks before processing the full dataset.
Full datasets—especially imports from Hugging Face or Kaggle—can mean many rows and a larger credit bill. You often want to validate column mapping, recipes, and brand_controls on a small subset of rows before committing to a full run.
job_specification on datasets.run carries job-level execution options. Setting max_rows there tells the platform to process at most that many rows from the dataset for that run, so you effectively subsample the same uploaded data instead of maintaining a separate tiny file.
Limit rows with max_rows
Section titled “Limit rows with max_rows”Use a JobSpecification dict with max_rows set to the cap you want (the Python types use a numeric type; integer literals such as 500 are fine):
run = client.datasets.run( dataset_id, column_mapping={ "prompt": "instruction", "completion": "response", }, job_specification={ "max_rows": 500, },)Omit job_specification (or max_rows) when you intend to process all rows in the dataset.
Combine with estimate=True
Section titled “Combine with estimate=True”For a cost and duration quote on that same subset, call datasets.run with estimate=True before starting the real run:
quote = client.datasets.run( dataset_id, column_mapping={"prompt": "instruction", "completion": "response"}, job_specification={"max_rows": 500}, estimate=True,)print(f"Subset estimate: {quote.estimated_credits_consumed} credits")
run = client.datasets.run( dataset_id, column_mapping={"prompt": "instruction", "completion": "response"}, job_specification={"max_rows": 500},)Full example
Section titled “Full example”Same flow as Getting started: ingest (or reuse a dataset_id), wait until the dataset is ready, run on a bounded number of rows, wait for completion, then export.
import osimport time
from adaption import Adaption, DatasetTimeout
client = Adaption(api_key=os.environ["ADAPTION_API_KEY"])dataset_id = os.environ.get("ADAPTION_DATASET_ID")
if not dataset_id: result = client.datasets.upload_file("large_training_data.csv") dataset_id = result.dataset_id while True: st = client.datasets.get_status(dataset_id) if st.row_count is not None: break time.sleep(2)
run = client.datasets.run( dataset_id, column_mapping={"prompt": "instruction", "completion": "response"}, job_specification={"max_rows": 1_000},)print(f"Run started (up to 1,000 rows): {run.run_id}")
try: final = client.datasets.wait_for_completion(dataset_id, timeout=3600) print(f"Finished: {final.status}") if final.error: raise RuntimeError(final.error.message)except DatasetTimeout: print("Timed out — poll datasets.get or get_status in your environment")
url = client.datasets.download(dataset_id)print(f"Download: {url}")When to use it
Section titled “When to use it”Use max_rows when you need a representative trial on production-scale data: confirm mappings and behavior, compare estimate=True quotes, or share a quick pilot without duplicating files. When you are ready for the full corpus, run again without max_rows (or with a higher cap) so the job processes the entire dataset.