GPT Batch Predictions

The BatchRunGPT class in cred-LLM-pipelines/batch_gpt.py provides a streamlined interface for running large-scale batch predictions through the OpenAI Batch API. It handles file chunking, job submission, status monitoring, result retrieval, and — importantly — cost estimation before committing to a run.

Repository: cred-LLM-pipelines
File: batch_gpt.py

Supported Models

Model	Endpoint	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)
`gpt-5-mini`	`/v1/responses`	$0.125	$1.00
`gpt-4o-mini`	`/v1/chat/completions`	$0.075	$0.30

Note

These are the Batch API prices, which are lower than the real-time API prices. The costs are defined as class-level constants in BatchRunGPT.

Quick Start

from batch_gpt import BatchRunGPT

batch_run = BatchRunGPT(
    data=df,                # DataFrame with an id column and a 'prompt_col' column
    system_prompt=prompt,   # system instructions for the model
    id_col="company_id",    # column used as unique identifier per row
    job_name="my_job",      # descriptive name — used for saving/loading batch IDs
)

batch_run.create_files()        # writes .jsonl batch files locally
batch_run.create_job()          # uploads files and submits batch jobs to OpenAI

# monitor progress (pass the batch index, e.g. 0)
batch_run.get_object_information(0)

# once status is 'completed'
results_df = batch_run.retrieve_results()

Input DataFrame Requirements

The DataFrame passed as data must contain:

A column whose name matches id_col — unique identifier per row (integer or string).
A column named prompt_col — the user-facing text that will be sent to the model.

Cost Estimation

Running batch jobs on hundreds of thousands of rows can get expensive. BatchRunGPT provides a way to estimate costs before submitting a job.

Sample-Based Estimation

Pass estimate_cost=True when creating the class. This runs 30 real API calls on a random sample of your data, measures actual token usage, and extrapolates to the full dataset.

batch_run = BatchRunGPT(
    data=df,
    system_prompt=prompt,
    id_col="company_id",
    job_name="my_job",
    gpt_model="gpt-5-mini",
    max_tokens=100,
    estimate_cost=True,       # triggers cost estimation
)

What happens under the hood:

A random sample of 30 rows is drawn from the DataFrame.
Each row is sent as a real (non-batch) API call to the selected model.
The actual input and output token counts are recorded.
Average tokens per row are multiplied by the total row count and the per-token cost for the selected model.
The estimated input and output costs are printed:

Estimating job cost by running on sample to count tokens...
This job is estimated to cost $1.23 for input and $0.45 for output

How It Works Internally

Batch Lifecycle

For context, here is the full lifecycle of a batch job — cost estimation is the optional first step:

graph LR
    A["estimate_job_cost()"] -->|optional| B["create_files()"]
    B --> C["create_job()"]
    C --> D["get_object_information()"]
    D --> E{Status?}
    E -->|completed| F["retrieve_results()"]
    E -->|in_progress| D
    E -->|error| G["cancel_job()"]

Constructor Parameters

Parameter	Type	Default	Description
`data`	`DataFrame`	—	Input data with `id_col` and `prompt_col` columns
`system_prompt`	`str`	—	System instructions for the model
`id_col`	`str`	—	Name of the unique identifier column
`job_name`	`str`	—	Descriptive job name (used for saving batch IDs)
`gpt_model`	`str`	`"gpt-5-mini"`	Model to use (`gpt-5-mini` or `gpt-4o-mini`)
`max_tokens`	`int`	`100`	Max output tokens per response
`batch_files_dir`	`str`	`"batch_files"`	Directory for local `.jsonl` batch files
`file_name`	`str`	`None`	Pickle file to reload batch IDs from a previous run
`big_input`	`bool`	`False`	Reduces batch size from 50k to 15k rows per file for large prompts
`estimate_cost`	`bool`	`False`	Run sample-based cost estimation before job creation

Key Methods

Method	Description
`create_files()`	Builds `.jsonl` files split into chunks of up to 50k rows
`create_job()`	Uploads files to OpenAI and creates batch objects
`get_object_information(batch_num)`	Returns the status of a specific batch
`get_completion_time(batch_num)`	Prints how long a completed batch took
`retrieve_results(ignore_incomplete)`	Downloads and merges results into the original DataFrame
`cancel_job()`	Cancels all submitted batches
`get_running_batches()`	Lists batches still in progress
`estimate_job_cost()`	Runs sample-based cost estimation (called automatically if `estimate_cost=True`)

Resuming a Previous Job

If you need to check on or retrieve results from a previously submitted job, pass the saved pickle file name:

batch_run = BatchRunGPT(
    data=df,
    system_prompt=prompt,
    id_col="company_id",
    job_name="my_job",
    file_name="batch_ids_my_job.pkl",   # saved automatically during create_job()
)

# check status
batch_run.get_object_information(0)

# retrieve when done
results_df = batch_run.retrieve_results()