Aún no disponible en Español — mostrando en inglés.

Quickstart for AI agents

This page is written for AI agents and automated systems that call inference.club programmatically — not for humans clicking through a dashboard. Skip anything that isn't API-relevant.

If you're a human developer: see Quickstart.


Credentials

Base URL:  https://api.inference.club/v1
Auth:      Authorization: Bearer <api-key>

One API key covers all modalities and all directions (inference + provider heartbeats). Get one at https://inference.club/dashboard/settings/token.


Step 1: discover available models

Before making any inference call, list what's reachable. Models are per-user — you only see models served by providers on your account (or shared providers your account can access).

curl https://api.inference.club/v1/models \
  -H "Authorization: Bearer $KEY"

Each entry includes capability fields beyond the OpenAI baseline:

{
  "id": "qwen3-8b",
  "object": "model",
  "owned_by": "home-rig",
  "service_type": "llm",
  "input_modalities": ["text"],
  "output_modalities": ["text"],
  "supported_features": ["streaming"],
  "context_length": 32768
}

Select a model by task

Use service_type to match a model to the task you want to run:

service_typeUse forEndpoint
llmText generation, reasoning, JSON extraction/v1/chat/completions
sttTranscription, audio understanding/v1/audio/transcriptions
ttsSpeech synthesis/v1/audio/speech
imageImage generation, editing/v1/images/generations
musicMusic generation/v1/music/generations
videoVideo generation/v1/videos/generations
mesh3D generation/v1/3d/generations

Check supported_features for capability-gated behaviors:

FeatureMeaning
streamingModel supports stream: true on chat/completions
timestampsSTT model returns word/segment timings with verbose_json
voice-cloningTTS model supports /v1/voice/generations with audio prompt cloning
tool-useLLM supports tool/function calling
visionLLM accepts image inputs in messages

Step 2: make a request

Requests are identical to OpenAI. Swap the base URL and key — nothing else changes.

Chat (LLM)

from openai import OpenAI

client = OpenAI(base_url="https://api.inference.club/v1", api_key=KEY)

resp = client.chat.completions.create(
    model="qwen3-8b",
    messages=[{"role": "user", "content": "Summarize this in one sentence: ..."}],
)
text = resp.choices[0].message.content

Image generation

img = client.images.generate(model="flux-dev", prompt="a glowing crystal cave")
url = img.data[0].url

Speech-to-text

with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(model="qwen3-asr", file=f)
text = result.text

Text-to-speech

with client.audio.speech.with_streaming_response.create(
    model="kokoro", input="Hello world", voice="af_heart"
) as r:
    r.stream_to_file("out.wav")

Step 3: handle errors

All errors use OpenAI's envelope shape:

{ "error": { "message": "...", "type": "no_provider" } }
HTTPtypeWhat to do
404no_providerNo online provider serves that model. Check /v1/models and pick a different one, or wait and retry.
502upstream_errorThe provider's local server failed. Retry with backoff; or pick a different model.
401Invalid or missing API key.
429Rate limited. Back off.
503async_disabledAsync is not enabled on this server. Fall back to synchronous.

Retry pattern for no_provider:

import time

for attempt in range(3):
    resp = requests.post(url, json=body, headers=auth)
    if resp.status_code == 404 and resp.json().get("error", {}).get("type") == "no_provider":
        time.sleep(5 * (attempt + 1))
        continue
    break

Step 4: async for long-running work

Add "async": true to any JSON-bodied request to get a 202 with a job id instead of waiting:

import requests, time

resp = requests.post(
    "https://api.inference.club/v1/videos/generations",
    headers={"Authorization": f"Bearer {KEY}"},
    json={"model": "ltx-2", "prompt": "a timelapse sunrise", "async": True},
)
job_id = resp.json()["id"]

# Poll until done
while True:
    job = requests.get(
        f"https://api.inference.club/v1/jobs/{job_id}",
        headers={"Authorization": f"Bearer {KEY}"},
    ).json()
    if job["status"] in ("PROCESSED", "FAILED", "CANCELED"):
        break
    time.sleep(3)

print(job.get("result_url"))

Use idempotency keys to deduplicate retries:

headers = {
    "Authorization": f"Bearer {KEY}",
    "Idempotency-Key": "my-unique-request-id-abc123",
}

Supported async modalities: chat/completions, completions, images/generations, videos/generations, music/generations, audio/speech.


Step 5: workflows (multi-step pipelines)

Workflows let an agent define a DAG of inference steps — fan out, transform data, chain modalities. Start from a curated template or write an inline spec.

Use a template

resp = requests.post(
    "https://api.inference.club/v1/workflows/runs",
    headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"},
    json={
        "template": "illustrated-story",
        "inputs": {
            "topic": "a robot learning to paint",
            "scenes": 3,
            "style": "oil painting, warm light",
        },
    },
)
run_id = resp.json()["id"]

Write an inline spec

spec = {
    "steps": [
        {
            "id": "plan",
            "kind": "inference",
            "type": "chat",          # modality: llm
            "extract": "json",        # parse LLM output as JSON
            "title": "Plan sections",
            "body": {
                "messages": [{
                    "role": "user",
                    "content": (
                        "List 4 blog section titles about {{inputs.topic}}. "
                        "Return JSON: {\"titles\":[\"...\"]}"
                    ),
                }],
            },
        },
        {
            "id": "images",
            "kind": "map",            # fan-out: one job per item
            "type": "image",
            "title": "Illustrate each section",
            "over": "{{steps.plan.output.titles}}",
            "body": {"prompt": "Blog illustration for: {{item}}"},
        },
    ]
}

resp = requests.post(
    "https://api.inference.club/v1/workflows/runs",
    headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"},
    json={"spec": spec, "inputs": {"topic": "distributed AI"}},
)
run_id = resp.json()["id"]

Poll a workflow run

while True:
    run = requests.get(
        f"https://api.inference.club/v1/workflows/runs/{run_id}",
        headers={"Authorization": f"Bearer {KEY}"},
    ).json()
    if run["status"] in ("DONE", "FAILED", "CANCELED"):
        break
    if run["status"] == "AWAITING":
        # A gate step is waiting for human approval — handle in your UI
        break
    time.sleep(5)

# Collect outputs
for step in run["steps"]:
    if step["status"] == "DONE":
        print(step["step_id"], step.get("output"))

Available templates

List with GET /v1/workflows/templates. Current templates:

keyWhat it doesRequired inputs
illustrated-storyWrite story → split into scenes → illustrate each (gate)topic, scenes, style
image-variationsBrainstorm N prompts → render them allsubject, count, vibe
storyboard-to-videoPlan shots → first frames (gate) → animate to videoconcept, shots
song-and-coverWrite lyrics+brief → track + cover art in paralleltheme, genre
narrated-explainerWrite script → TTS narrate each linetopic, lines

Reference

Full endpoint table

MethodPathNotes
GET/v1/modelsList available models + capabilities
POST/v1/chat/completionsLLM chat. Supports stream: true, async: true.
POST/v1/completionsLegacy completion. Supports async: true.
POST/v1/audio/transcriptionsSTT. multipart/form-data. Synchronous only.
POST/v1/audio/speechTTS. Returns raw audio bytes. Supports async: true.
GET/v1/audio/voices?model=<id> — list voices for a TTS model
POST/v1/images/generationsText-to-image. Supports async: true.
POST/v1/images/editsImage edit. multipart/form-data. Synchronous only.
POST/v1/music/generationsMusic gen. Returns raw audio bytes. Supports async: true.
POST/v1/videos/generationsVideo gen. Returns raw MP4. Supports async: true.
POST/v1/voice/generationsDia voice cloning. Synchronous only.
POST/v1/3d/generations3D mesh gen.
GET/v1/jobsList async jobs (?status=, ?active=1, ?limit=)
GET/v1/jobs/<id>Job status + result
POST/v1/jobs/<id>/cancelCancel a queued/processing job
POST/v1/jobs/<id>/retryRe-queue a failed/canceled job
POST/v1/batchesSubmit up to 256 requests as one batch
GET/v1/batchesList batches
GET/v1/batches/<id>Batch status
POST/v1/batches/<id>/cancelCancel all jobs in a batch
GET/v1/workflows/templatesList curated templates
POST/v1/workflows/runsStart a workflow run
GET/v1/workflows/runsList runs
GET/v1/workflows/runs/<id>Run state + step outputs
POST/v1/workflows/runs/<id>/steps/<step_id>/approveApprove a gate step
POST/v1/workflows/runs/<id>/steps/<step_id>/rejectReject a gate step

Sharing fields

Any JSON-bodied inference call accepts these extra fields (stripped before reaching a provider):

{ "visibility": "PUBLIC", "collection": "my-album" }

Visibility values: PUBLIC, UNLISTED (default), PRIVATE, SECRET.

Voice cloning request shape

{
  "input": "[S1] Hello!\n[S2] Hey there.",
  "speakers": { "S1": 12, "S2": 17 },
  "cfg_scale": 3.0,
  "temperature": 1.8,
  "seed": 42
}

Speaker IDs come from GET /api/inference/voice-samples/ (your voice library).