Skip to the content.

Run any Hugging Face model on your own GPU

Two commands. Zero config. No YAML. No CLI flags to memorise.

uv tool install inferhost
inferhost

That’s it. inferhost opens a friendly terminal UI. The first launch downloads llama.cpp and llama-swap for you with a progress bar. Then you press a, paste a Hugging Face repo id, and you have an OpenAI-compatible endpoint running on http://localhost:9090/v1.

inferhost TUI dashboard

Install inferhost » Show me how it works » GitHub


What you get

The 60-second tour

1. Install

uv tool install inferhost

(Python 3.11+ on Linux or macOS. pipx install inferhost works too. See the Installation page for upgrade and uninstall steps.)

2. Launch

inferhost

On the very first run you’ll see a small progress screen while the runtime binaries download. After that, you land on the dashboard.

3. Add a model

Press a. Type a Hugging Face repo id, e.g.:

Qwen/Qwen2.5-7B-Instruct-GGUF

Press Enter. inferhost lists all available GGUF files in the repo, highlights the one that best fits your GPU with a ⭐, and shows a live progress bar while it downloads.

4. Use it

The dashboard shows the OpenAI-compatible endpoint at the top. Point anything at it:

curl http://localhost:9090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b-instruct-q4-k-m",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Or with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:9090/v1", api_key="none")
resp = client.chat.completions.create(
    model="qwen2.5-7b-instruct-q4-k-m",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

Keys in the TUI

Key What it does
a Add a Hugging Face model (downloads the GGUF + any mmproj-*.gguf)
n Rename the highlighted model’s alias — also rewrites llama-swap + LiteLLM configs
c Configure the highlighted model — per-model context window and KV cache quant
P Pin the highlighted model — keep it co-resident in VRAM with other pinned models
d or Delete Remove the highlighted model
s Start llama-swap
x Stop llama-swap
r Restart llama-swap
g Toggle the LiteLLM gateway on/off
p Open the Preferences panel (change ports, context, GPU layers, …)
R Refresh the view
q Quit

Architecture in one diagram

   Your app  ──HTTP──▶  llama-swap  ──spawns──▶  llama-server (llama.cpp)
                       (port 9090)              (GGUF inference)
                            ▲
                            │
                  (optional) LiteLLM gateway
                            │
                       (port 9001)

Where to next?