Audio transcriptions

Transcribe speech to text with any STT (speech-to-text / ASR) model on the network. The endpoint mirrors OpenAI's transcription API, so the official SDKs and curl work unchanged.

POST /v1/audio/transcriptions

This is a separate modality from the chat/completions LLM endpoints — it takes an audio file (multipart/form-data, not JSON) and returns text. Requests are routed only to services a provider has declared as type: stt, so a transcription never lands on a text model.

Request

multipart/form-data with these fields:

FieldRequiredDescription
fileyesThe audio file. wav, mp3, m4a, flac, ogg, or webm. Up to 25 MB.
modelyesAn STT model id from GET /v1/models (e.g. qwen/qwen3-asr-1.7b).
languagenoISO-639-1 hint (e.g. en). Improves accuracy/latency when known.
promptnoFree-text hint — names, jargon, or context to bias decoding.
response_formatnojson (default) or verbose_json. See Timestamps.
timestamp_granularities[]noword and/or segment. Requires verbose_json and a model that supports it.

curl

curl https://api.inference.club/v1/audio/transcriptions \
  -H "Authorization: Bearer $INFERENCE_CLUB_API_KEY" \
  -F file=@audio.wav \
  -F model=qwen/qwen3-asr-1.7b

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inference.club/v1",
    api_key="<your-api-key>",
)

with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        model="qwen/qwen3-asr-1.7b",
        file=f,
    )
print(result.text)

Response

Default (response_format=json):

{
  "text": "Hey, this is a demo of the new model…",
  "usage": { "type": "duration", "seconds": 10 }
}

usage.seconds is the audio duration — the metering unit for speech, the way token counts are for text. It's recorded on every transcription request.

Timestamps

When you ask for response_format=verbose_json with timestamp_granularities[], models that support it return word- and segment-level timings:

{
  "text": "Hello world",
  "language": "en",
  "duration": 1.2,
  "segments": [{ "id": 0, "start": 0.0, "end": 1.2, "text": "Hello world" }],
  "words": [
    { "word": "Hello", "start": 0.0, "end": 0.5 },
    { "word": "world", "start": 0.6, "end": 1.2 }
  ]
}

Not every deployment supports timestamps. Whether word/segment timings are available depends on how the provider serves the model, not just the model id — e.g. Qwen3-ASR returns timestamps only when launched with its ForcedAligner (plain vllm serve rejects verbose_json). So the capability is declared by the operator in their agent manifest (services[].features: [timestamps]) and surfaced as the timestamps entry in a model's supported_features on /v1/models.

When a model isn't declared timestamp-capable, inference.club automatically downgrades a verbose_json request to plain json — so you get a clean transcript instead of an upstream error, never a fake one. The in-dashboard Transcription playground only offers the timestamp toggle, and renders the interactive click-to-seek transcript, when the selected model actually supports it.

Errors

typeWhenHTTP
missing_fileNo file field in the request400
file_too_largeAudio exceeds the 25 MB cap413
unsupported_media_typeThe file's content-type isn't an accepted audio type415
no_providerNo online STT provider serves the requested model for you404
upstream_errorThe provider's local ASR server failed or didn't respond502

Not supported (yet)

Translations (/v1/audio/translations) and streaming transcription are not available. The response is always a single buffered JSON body.