together-batch-inference | togetherai-skills

Stats

Actions

Tags

together-batch-inference | togetherai-skills

Together Batch Inference

Overview

Use Together AI's Batch API for large offline workloads where latency is not the primary concern.

Typical fits:

bulk classification
synthetic data generation
dataset transformations
large summarization or enrichment jobs
low-cost asynchronous inference

When This Skill Wins

The user has many independent requests to run
A JSONL request file is acceptable
Turnaround time can be minutes or hours instead of seconds
Lower cost matters more than immediate interactivity

Hand Off To Another Skill

Use together-chat-completions for real-time requests or tool-calling apps
Use together-evaluations for managed LLM-as-a-judge workflows
Use together-embeddings for retrieval-specific vector generation

Quick Routing

End-to-end batch workflow
- Start with scripts/batch_workflow.py or scripts/batch_workflow.ts
Request format, status model, and result downloads
- Read references/api-reference.md
Operational guidance and batch sizing
- Read references/api-reference.md

Workflow

Build a JSONL file where each line contains custom_id and body.
Upload the file with purpose="batch-api".
Create the batch with input_file_id=... and the target endpoint.
Poll until the job is terminal.
Download output and error files, then reconcile by custom_id.

High-Signal Rules

Python scripts require the Together v2 SDK (together>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".
Use input_file_id, not legacy file parameters.
Keep custom_id stable and meaningful so result reconciliation is easy.
Batch is for independent requests. If the workload depends on shared conversation state, it is probably the wrong tool.
Always inspect the error file in addition to the success output.
client.batches.create() returns a wrapper; access the batch object via response.job (e.g., response.job.id). client.batches.retrieve() returns the batch object directly.
For classification or labeling workloads, set max_tokens low (e.g., 4), use temperature: 0, and constrain the system prompt to return only the label. This minimizes output tokens and cost.
Small batches (under 1K requests) typically complete in minutes. The 24-hour completion window is a maximum, not typical.

Resource Map

API reference and operational guidance: references/api-reference.md
Python workflow: scripts/batch_workflow.py
TypeScript workflow: scripts/batch_workflow.ts

Official Docs