--- title: Processing large datasets | Adaption description: Run Adaptive Data on a subset of rows for pilots and cost checks before processing the full dataset. --- Full datasets—especially imports from Hugging Face or Kaggle—can mean **many rows** and a **larger credit bill**. You often want to **validate column mapping**, recipes, and `brand_controls` on a **small subset** of rows before committing to a full run. **`job_specification`** on `datasets.run` carries **job-level execution options**. Setting **`max_rows`** there tells the platform to process **at most that many rows** from the dataset for that run, so you effectively **subsample** the same uploaded data instead of maintaining a separate tiny file. Pass **`job_specification`** as a keyword argument to `datasets.run`. See **JobSpecification** in the [API Reference](/api/index.md) for **`max_rows`**, **`idempotency_key`**, and any additional fields the schema exposes. ## Limit rows with `max_rows` Use a **`JobSpecification`** dict with **`max_rows`** set to the cap you want (the Python types use a numeric type; integer literals such as `500` are fine): ``` run = client.datasets.run( dataset_id, column_mapping={ "prompt": "instruction", "completion": "response", }, job_specification={ "max_rows": 500, }, ) ``` Omit **`job_specification`** (or **`max_rows`**) when you intend to process **all** rows in the dataset. ## Combine with `estimate=True` For a **cost and duration quote** on that same subset, call `datasets.run` with **`estimate=True`** before starting the real run: ``` quote = client.datasets.run( dataset_id, column_mapping={"prompt": "instruction", "completion": "response"}, job_specification={"max_rows": 500}, estimate=True, ) print(f"Subset estimate: {quote.estimated_credits_consumed} credits") run = client.datasets.run( dataset_id, column_mapping={"prompt": "instruction", "completion": "response"}, job_specification={"max_rows": 500}, ) ``` ## Full example Same flow as [Getting started](/introduction/getting-started/index.md): ingest (or reuse a `dataset_id`), wait until the dataset is ready, run on a **bounded number of rows**, wait for completion, then export. ``` import os import time from adaption import Adaption, DatasetTimeout client = Adaption(api_key=os.environ["ADAPTION_API_KEY"]) dataset_id = os.environ.get("ADAPTION_DATASET_ID") if not dataset_id: result = client.datasets.upload_file("large_training_data.csv") dataset_id = result.dataset_id while True: st = client.datasets.get_status(dataset_id) if st.row_count is not None: break time.sleep(2) run = client.datasets.run( dataset_id, column_mapping={"prompt": "instruction", "completion": "response"}, job_specification={"max_rows": 1_000}, ) print(f"Run started (up to 1,000 rows): {run.run_id}") try: final = client.datasets.wait_for_completion(dataset_id, timeout=3600) print(f"Finished: {final.status}") if final.error: raise RuntimeError(final.error.message) except DatasetTimeout: print("Timed out — poll datasets.get or get_status in your environment") url = client.datasets.download(dataset_id) print(f"Download: {url}") ``` ## When to use it Use **`max_rows`** when you need a **representative trial** on production-scale data: confirm mappings and behavior, compare **`estimate=True`** quotes, or share a quick pilot without duplicating files. When you are ready for the full corpus, run again **without** **`max_rows`** (or with a higher cap) so the job processes the entire dataset.