上线于api.inference.club

由分布式推理网络驱动，背后是消费级 GPU 和 Tailscale

一个兼容 OpenAI 的端点，由成员接入网络的 GPU 提供支持。在你自己的硬件上运行 agent，通过一把密钥使用整个算力池。

它们如何协作

家用 GPU，一个私有网络.

inference.club is a Tailscale tailnet that joins consumer hardware — RTX PCs, the DGX Spark, Apple silicon — so members can safely expose their inference through one unified API, across the whole range of AI modalities: chat, images, video, speech, music, 3D.

精选生成内容

诞生于网络之上

来自成员的真实请求，由社区贡献的硬件生成。点击任意卡片查看完整请求。

disco funk vibes party

MUSIC

briancaffeyclub-host NVIDIA-GeForce-RTX-4090 +1 1

“Type text, pick a voice, and synthesize natural speech.”

TTS

briancaffeyclub-host NVIDIA-GeForce-RTX-4090 +1

summarize this article: AI Outperforms Law Professors in Stanford Law Study In a rigorous blind study, law professors overwhelmingly preferred AI-generated answers to student legal questions over answers written by fellow law professors—and flagged the AI answers as potentially m…

**Summary** A blind study led by Stanford Law professor Julian Nyarko found that law professors overwhelmingly preferred AI‑generated answers to contract‑law questions over answers written by their fellow professors. In a head‑to‑head comparison of nearly 3,000 anonymized respon…

LLM

briancaffeyclub-host NVIDIA-GeForce-RTX-4090 +1 1

IMAGE

briancaffeyclub-host NVIDIA-GeForce-RTX-4090 +1

MESH

briancaffeyclub-host NVIDIA-GeForce-RTX-4090 +1

VIDEO

briancaffeyclub-host NVIDIA-GeForce-RTX-4090 +1

面向用户

OpenAI SDK 即插即用。

consumer

export OPENAI_API_KEY=ic_xxxxxxxxxxxxxxxxxxxx
export OPENAI_BASE_URL=https://api.inference.club/v1

curl $OPENAI_BASE_URL/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3.6-27b",
    "messages": [
      {"role": "user", "content": "explain MoE in one sentence"}
    ]
  }'

from openai import OpenAI

client = OpenAI(
    api_key="ic_xxxxxxxxxxxxxxxxxxxx",
    base_url="https://api.inference.club/v1",
)

resp = client.chat.completions.create(
    model="qwen/qwen3.6-27b",
    messages=[
        {"role": "user", "content": "explain MoE in one sentence"},
    ],
)
print(resp.choices[0].message.content)

import OpenAI from "openai"

const client = new OpenAI({
  apiKey: process.env.INFERENCE_CLUB_KEY,
  baseURL: "https://api.inference.club/v1",
})

const resp = await client.chat.completions.create({
  model: "qwen/qwen3.6-27b",
  messages: [
    { role: "user", content: "explain MoE in one sentence" },
  ],
})

console.log(resp.choices[0].message.content)

面向提供者

封装任意兼容 OpenAI 的服务器。

已经在运行 vllm、llama.cpp 或 ollama？把 agent 指向它，你就成了网络中的一个节点。

provider

# Already running vLLM, llama.cpp, or Ollama on your GPU?
# Point the agent at it and join the network.

export INFERENCE_CLUB_API_KEY=ic_xxxxxxxxxxxxxxxxxxxx
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=local-key  # whatever your local server expects

docker run --rm -d --name club-agent --network host \
  -e INFERENCE_CLUB_API_KEY \
  -e OPENAI_BASE_URL \
  -e OPENAI_API_KEY \
  ghcr.io/inference-club/inference-club-agent:latest

# Or just run the static binary — no Docker required.

export INFERENCE_CLUB_API_KEY=ic_xxxxxxxxxxxxxxxxxxxx
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=local-key

./inference-club-agent

# The agent registers, joins the inference.club tailnet,
# advertises the models your local server is serving,
# and starts taking requests. That's it.

架构

三个部分。毫无玄机。

跟随一个请求从你的代码到 GPU 再返回。云端控制平面负责认证、应用你的隐私规则并进行路由——但模型本身运行在你自己拥有的硬件上。

步骤01

运营者运行 agent

成员在本地 LLM 服务器旁运行 inference-club-agent。agent 会公布该服务器所托管的全部模型。

步骤02

agent 加入 tailnet

每个 agent 获得一个短期有效的 Tailscale 密钥，加入我们的私有 mesh 网络。没有公开端点，无需端口转发，只有 WireGuard。

步骤03

用户发送请求

发往 api.inference.club 的调用会路由到提供所请求模型的在线 agent。支持流式传输，延迟直连。

Your application

curl · OpenAI SDK · the Playground · your agents

client

base_url = "https://api.inference.club/v1"
api_key = "ic_xxxxxxxxxxxxxxxxxxxx"

HTTPS · Authorization: Bearer ic_…

api.inference.club — the control plane

one small cloud VPS (Hetzner). It routes; it never runs the model.

cloud

Caddy

TLS · reverse proxy

Django + DRF

OpenAI-compatible /v1 router · auth · routing

Access control

visibility · per-service ACLs · kill switch

Celery workers

async jobs · batches · workflow DAG

Postgres + Redis

state · queue · throttling

GCS

images · video · voice · music

The inference.club tailnet

a private Tailscale mesh — pure WireGuard

SOCKS5 sidecarMagicDNS · club-host-17:443short-lived auth keysno ports · no firewall holes

Your rig — where inference actually happens

a GPU you own, at home, on hardware you trust

operator

inference-club-agentcontainer · --network host

Joins the tailnet with its minted key, advertises models from agent.yaml, and forwards each request to whatever you already run locally:

vLLMllama.cppOllamaLM StudioDiaLTX-2

→ http://localhost:1234/v1

running onbrian's 4090M3 Ultra · 192GBDGX Spark2× 3090 rigk3s home cluster

Follow one request

1Your code calls api.inference.club/v1 with your ic_ key — the same request you’d send OpenAI.
2Caddy terminates TLS; Django authenticates the key and applies your privacy + access rules.
3The router picks a healthy, online node that actually serves the requested model.
4Django (via a Tailscale SOCKS5 sidecar) dials the node by MagicDNS over WireGuard — no ports, no tunnels.
5The agent container hands the request to your local LLM server on localhost.
6Tokens (or images / video / audio) stream back along the exact same path.

In one breath

“inference.club is what happens when you point an OpenAI-compatible API at a pile of consumer GPUs you actually own and trust — a private Tailscale tailnet quietly stitching a 4090 here, an M3 Ultra there, a DGX Spark and a couple of 3090s into one WireGuard mesh with no ports forwarded and no firewall holes, where a little inference-club-agent container sits next to whatever you’re already running — vLLM, llama.cpp, Ollama, LM Studio — and advertises it through a manifest, while back in the cloud a Django + Celery server behind Caddy authenticates your ic_ key, enforces your privacy and per-service access controls, and routes the call over the tailnet by MagicDNS to a healthy online node, with Redis and Postgres driving async jobs, batches and a whole workflow DAG engine, GCS holding the images, video, voice and music that come back, a Nuxt playground and dashboard to poke at all of it, the home fleet itself migrating from Docker to k3s, and the entire thing — chat, images, LTX-2 video, Dia voice cloning, speech, the works — sitting behind one base URL you can curl, so go ahead and build something, and, as the prompt says: Make no mistakes.”

为什么选择 inference.club

为契合开源模型的真实运行方式而打造。

兼容 OpenAI

即插即用替代方案。换上基础 URL 和密钥——你现有的 SDK 和提示词即可正常工作。

真实 GPU，真实模型

成员在自己的硬件上提供开放权重模型：Qwen、Llama、DeepSeek、Mistral、Gemma。

默认私有

请求通过 Tailscale 抵达提供者，端到端加密。没有可被抓取的公开端点。

是俱乐部，不是供应商

与你信任的人共享算力。有空闲算力时接入节点，需要时使用整个网络。

来自博客

全部文章

精选

From docker sprawl to k3s: rebuilding my home inference fleet

A 'healthy' mesh-generation service sat wedged for three days while my agent.yaml described services that didn't exist. So I moved four GPU boxes — three RTX 4090s and a DGX Spark — onto k3s and taught the inference-club-agent to discover services from the Kubernetes API instead of a config file. Health checks lie; queues don't. Config is fiction; clusters are testimony.

#k3s #kubernetes #homelab #architecture #deep-dive

阅读文章

准备好接入了吗？

用 GitHub 登录，生成密钥，一分钟内即可上线。有空闲算力时随时接入节点。

𝕏

由分布式推理网络驱动，背后是 消费级 GPU 和 Tailscale

家用 GPU，一个 私有网络.

诞生于 网络之上

OpenAI SDK 即插即用。

封装任意兼容 OpenAI 的服务器。

三个部分。 毫无玄机。

运营者运行 agent

agent 加入 tailnet

用户发送请求

为契合 开源模型 的真实运行方式而打造。

兼容 OpenAI

真实 GPU，真实模型

默认私有

是俱乐部，不是供应商

来自博客

From docker sprawl to k3s: rebuilding my home inference fleet

准备好接入了吗？

由分布式推理网络驱动，背后是消费级 GPU 和 Tailscale

家用 GPU，一个私有网络.

诞生于网络之上

三个部分。毫无玄机。

为契合开源模型的真实运行方式而打造。