Blog

Guías, novedades e ideas sobre inferencia LLM distribuida y autoalojada.

Más reciente EN

One app, three environments: self-hosting inference.club on my home cluster

9 de julio de 2026 · @briancaffey

inference.club is a hosted app, but it's also something you can run yourself — to catalog your own inference across LLM, image, video, music, voice, and 3D. Here's how the same Django + Nuxt app runs unchanged across a laptop, a cloud VPS, and a home k3s cluster, deployed by a fully self-hosted GitOps loop (Forgejo → Harbor → Argo CD) with no managed cloud in the path.

#homelab #k3s #gitops #self-hosting #architecture #deep-dive

26 de junio de 2026 · @briancaffey

Reading the web with my own AI: the extension, a month in EN

A couple of weeks ago I shipped a deliberately tiny reading copilot — a side panel that summarizes and asks about whatever you're reading, on your own cluster. Then I kept building. This is the tour: vision attachments, a speed reader, per-URL history, ten themes, an advanced mode that shows you exactly what was sent — and where it goes next.

#browser-extension #privacy #llm-inference

25 de junio de 2026 · @briancaffey

The debt we built: a state-of-the-codebase post EN

An honest accounting of the technical debt in inference.club after a year of shipping. Every new inference modality currently costs a copy-paste in three places — a backend view, a backend rerun runner, and a frontend page. Here's the architecture gap behind that, mirrored on both sides of the wire, and the sequenced plan to pay it down.

#engineering #refactoring #deep-dive #architecture

24 de junio de 2026 · @briancaffey

A reading copilot that runs on your own GPU EN

The inference.club web app is great until you're reading an article somewhere else and want it summarized — then you're back to copy-paste. So we built a browser extension: a side panel that summarizes and answers questions about whatever you're reading, extracted locally with the Firefox Reader Mode engine and sent only to your own cluster. No third-party AI reading what you read.

#browser-extension #privacy #llm-inference #announcements

12 de junio de 2026 · @briancaffey

From docker sprawl to k3s: rebuilding my home inference fleet EN

A 'healthy' mesh-generation service sat wedged for three days while my agent.yaml described services that didn't exist. So I moved four GPU boxes — three RTX 4090s and a DGX Spark — onto k3s and taught the inference-club-agent to discover services from the Kubernetes API instead of a config file. Health checks lie; queues don't. Config is fiction; clusters are testimony.

#k3s #kubernetes #homelab #architecture #deep-dive

30 de mayo de 2026 · @briancaffey

Getting started with inference.club in five minutes EN

From GitHub sign-in to your first chat completion: a step-by-step quickstart for inference.club's OpenAI-compatible API, plus how to wire it into Open WebUI and the OpenAI SDK.

#quickstart #guide #llm-inference

30 de mayo de 2026 · @briancaffey

Inference wants to be distributed — and now NVIDIA agrees EN

Local models keep getting better while the grid can't build centralized data centers fast enough. Span and NVIDIA's new XFRA puts Blackwell GPUs inside homes to tap idle power — strong validation for distributing AI compute to the edge, which is exactly the bet inference.club is making.

#distributed-inference #vision #industry

26 de abril de 2026 · @briancaffey

Idle GPUs, growing bills, and the case for a community LLM network EN

Why we're building inference.club — a self-hosted, OpenAI-compatible LLM inference network where you bring your own GPU or use someone else's, all behind one API key.

#announcements #vision #llm-inference

26 de abril de 2026 · @briancaffey

Putting your home GPU on the internet with Tailscale and tsnet EN

How inference.club uses an embedded Tailscale stack on the agent and a userspace sidecar on the server to route LLM traffic from a public OpenAI-compatible API into a private GPU on someone's home network — with no port forwarding, no public callback URLs, and no shared secrets per device.

#architecture #tailscale #networking #deep-dive