How it fits together
Home GPUs, one private network.
Drop-in for the OpenAI SDK.
Sign up, mint a token, point your client at api.inference.club/v1. Every model on the network is one request away.
export OPENAI_API_KEY=ic_xxxxxxxxxxxxxxxxxxxx
export OPENAI_BASE_URL=https://api.inference.club/v1
curl $OPENAI_BASE_URL/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen3.6-27b",
"messages": [
{"role": "user", "content": "explain MoE in one sentence"}
]
}'Wrap any OpenAI-compatible server.
Already running vllm, llama.cpp, or ollama? Point the agent at it and you're a node on the network.
# Already running vLLM, llama.cpp, or Ollama on your GPU?
# Point the agent at it and join the network.
export INFERENCE_CLUB_API_KEY=ic_xxxxxxxxxxxxxxxxxxxx
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=local-key # whatever your local server expects
docker run --rm -d --name club-agent --network host \
-e INFERENCE_CLUB_API_KEY \
-e OPENAI_BASE_URL \
-e OPENAI_API_KEY \
ghcr.io/inference-club/inference-club-agent:latestArchitecture
Three pieces. Nothing magic.
Operators run agents
Members run inference-club-agent next to their local LLM server. The agent advertises whatever models the server is hosting.
Agents join the tailnet
Each agent receives a short-lived Tailscale key and joins our private mesh. No public endpoints. No port forwarding. Just WireGuard.
Consumers send requests
Calls to api.inference.club route to an online agent serving the requested model. Streaming works. Latency is direct.
Why inference.club
Built for the way open models actually run.
OpenAI-compatible
Drop-in replacement. Swap the base URL and key — your existing SDKs and prompts just work.
Real GPUs, real models
Members serve open-weight models on their own hardware: Qwen, Llama, DeepSeek, Mistral, Gemma.
Private by default
Requests reach providers over Tailscale, end-to-end encrypted. No public endpoints to scrape.
A club, not a vendor
Pool compute with people you trust. Bring a node when you have spare cycles. Use the network when you need them.
From the blog
All posts
Featured
Inference wants to be distributed — and now NVIDIA agrees
Local models keep getting better while the grid can't build centralized data centers fast enough. Span and NVIDIA's new XFRA puts Blackwell GPUs inside homes to tap idle power — strong validation for distributing AI compute to the edge, which is exactly the bet inference.club is making.