Skip to content

GPT Batch Predictions

The BatchRunGPT class in cred-LLM-pipelines/batch_gpt.py provides a streamlined interface for running large-scale batch predictions through the OpenAI Batch API. It handles file chunking, job submission, status monitoring, result retrieval, and — importantly — cost estimation before committing to a run.

Repository: cred-LLM-pipelines
File: batch_gpt.py


Supported Models

Model Endpoint Input Cost (per 1M tokens) Output Cost (per 1M tokens)
gpt-5-mini /v1/responses $0.125 $1.00
gpt-4o-mini /v1/chat/completions $0.075 $0.30

Note

These are the Batch API prices, which are lower than the real-time API prices. The costs are defined as class-level constants in BatchRunGPT.


Quick Start

from batch_gpt import BatchRunGPT

batch_run = BatchRunGPT(
    data=df,                # DataFrame with an id column and a 'prompt_col' column
    system_prompt=prompt,   # system instructions for the model
    id_col="company_id",    # column used as unique identifier per row
    job_name="my_job",      # descriptive name — used for saving/loading batch IDs
)

batch_run.create_files()        # writes .jsonl batch files locally
batch_run.create_job()          # uploads files and submits batch jobs to OpenAI

# monitor progress (pass the batch index, e.g. 0)
batch_run.get_object_information(0)

# once status is 'completed'
results_df = batch_run.retrieve_results()

Input DataFrame Requirements

The DataFrame passed as data must contain:

  • A column whose name matches id_col — unique identifier per row (integer or string).
  • A column named prompt_col — the user-facing text that will be sent to the model.

Cost Estimation

Running batch jobs on hundreds of thousands of rows can get expensive. BatchRunGPT provides a way to estimate costs before submitting a job.

Sample-Based Estimation

Pass estimate_cost=True when creating the class. This runs 30 real API calls on a random sample of your data, measures actual token usage, and extrapolates to the full dataset.

batch_run = BatchRunGPT(
    data=df,
    system_prompt=prompt,
    id_col="company_id",
    job_name="my_job",
    gpt_model="gpt-5-mini",
    max_tokens=100,
    estimate_cost=True,       # triggers cost estimation
)

What happens under the hood:

  1. A random sample of 30 rows is drawn from the DataFrame.
  2. Each row is sent as a real (non-batch) API call to the selected model.
  3. The actual input and output token counts are recorded.
  4. Average tokens per row are multiplied by the total row count and the per-token cost for the selected model.
  5. The estimated input and output costs are printed:

Estimating job cost by running on sample to count tokens...
This job is estimated to cost $1.23 for input and $0.45 for output

How It Works Internally

Batch Lifecycle

For context, here is the full lifecycle of a batch job — cost estimation is the optional first step:

graph LR
    A["estimate_job_cost()"] -->|optional| B["create_files()"]
    B --> C["create_job()"]
    C --> D["get_object_information()"]
    D --> E{Status?}
    E -->|completed| F["retrieve_results()"]
    E -->|in_progress| D
    E -->|error| G["cancel_job()"]

Constructor Parameters

Parameter Type Default Description
data DataFrame Input data with id_col and prompt_col columns
system_prompt str System instructions for the model
id_col str Name of the unique identifier column
job_name str Descriptive job name (used for saving batch IDs)
gpt_model str "gpt-5-mini" Model to use (gpt-5-mini or gpt-4o-mini)
max_tokens int 100 Max output tokens per response
batch_files_dir str "batch_files" Directory for local .jsonl batch files
file_name str None Pickle file to reload batch IDs from a previous run
big_input bool False Reduces batch size from 50k to 15k rows per file for large prompts
estimate_cost bool False Run sample-based cost estimation before job creation

Key Methods

Method Description
create_files() Builds .jsonl files split into chunks of up to 50k rows
create_job() Uploads files to OpenAI and creates batch objects
get_object_information(batch_num) Returns the status of a specific batch
get_completion_time(batch_num) Prints how long a completed batch took
retrieve_results(ignore_incomplete) Downloads and merges results into the original DataFrame
cancel_job() Cancels all submitted batches
get_running_batches() Lists batches still in progress
estimate_job_cost() Runs sample-based cost estimation (called automatically if estimate_cost=True)

Resuming a Previous Job

If you need to check on or retrieve results from a previously submitted job, pass the saved pickle file name:

batch_run = BatchRunGPT(
    data=df,
    system_prompt=prompt,
    id_col="company_id",
    job_name="my_job",
    file_name="batch_ids_my_job.pkl",   # saved automatically during create_job()
)

# check status
batch_run.get_object_information(0)

# retrieve when done
results_df = batch_run.retrieve_results()