together-chat-completions | togetherai-skills

Stats

Actions

Tags

together-chat-completions | togetherai-skills

Together Chat Completions

Overview

Use Together AI's serverless chat/completions API for interactive inference workloads:

basic text generation
streaming responses
multi-turn chat state
tool and function calling
structured outputs
reasoning-capable models

Treat this skill as the default entry point for Together AI text generation unless the task is clearly offline batch processing, vector retrieval, model training, or infrastructure management.

When This Skill Wins

Build a chatbot, assistant, or text-generation endpoint on Together AI
Add streaming output to a real-time user experience
Implement tool calling or function-calling loops
Constrain model output to JSON or a regex-defined shape
Choose between standard chat models and reasoning models
Debug request parameters, model behavior, or response shapes

Hand Off To Another Skill

Use together-batch-inference for large offline runs, backfills, or lower-cost asynchronous jobs
Use together-embeddings for vector search, semantic retrieval, or reranking
Use together-fine-tuning when the user wants to train or adapt a model
Use together-dedicated-endpoints when the user needs always-on single-tenant hosting
Use together-dedicated-containers or together-gpu-clusters for custom infrastructure

Quick Routing

Workflow

Confirm that the workload is interactive serverless inference rather than batch, retrieval, or training.
Pick the smallest model that satisfies latency, quality, and context requirements.
Decide whether the job needs plain text, tools, structured output, or reasoning.
Start from the matching script instead of re-deriving request shapes from scratch.
Pull deeper details from the relevant reference file only when needed.

High-Signal Rules

Python scripts require the Together v2 SDK (together>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".
Use client.chat.completions.create() for Python and client.chat.completions.create() for TypeScript.
Preserve full messages history for multi-turn conversations; do not rebuild context from final text only.
For tools, implement the full loop: model tool call -> execute tool -> append tool result -> second model call.
Prefer json_schema over looser JSON modes when the user needs stable machine-readable output.
Use reasoning models only when the task benefits from deeper deliberation; otherwise prefer cheaper standard models.
To combine tool calling with structured output, use a two-phase approach: Phase 1 sends tools (no response_format), Phase 2 sends response_format (no tools) after tool results are appended.
Streaming works with response_format; accumulate chunks and parse the final concatenated string as JSON.
If the user needs many independent requests, combine this skill with async_parallel.py or hand off to batch inference.

Resource Map

Parameters and response fields: references/api-parameters.md
OpenAI compatibility, rate-limit headers, and debug headers: references/api-parameters.md
Function-calling patterns: references/function-calling-patterns.md
Structured outputs: references/structured-outputs.md
Reasoning models: references/reasoning-models.md
Model catalog: references/models.md

Scripts

scripts/chat_basic.py and scripts/chat_basic.ts: basic chat, streaming, and multi-turn state
scripts/debug_headers.py and scripts/debug_headers.ts: raw-response inspection for routing, latency, and rate-limit headers
scripts/async_parallel.py: async Python fan-out for independent requests
scripts/tool_call_loop.py and scripts/tool_call_loop.ts: full tool-call loop
scripts/structured_outputs.py and scripts/structured_outputs.ts: schema-guided and regex outputs
scripts/reasoning_models.py and scripts/reasoning_models.ts: reasoning fields, effort, and hybrid toggles

Official Docs