Troubleshooting

The TUI says “llama-swap stopped”

The daemon isn’t running. Press s in the TUI to start it. If it fails to start, check the log file at ~/.local/share/inferhost/logs/llama-swap.log.

Most common causes:

No models registered yet. Press a to add one first.
Port in use. Another process is on 9090. Set INFERHOST_SWAP_PORT=... in .env to a free port and restart.
Binary missing. Re-launch inferhost — it will redownload missing binaries on next start.

“Port 9090 is already in use”

Either:

Find the process: lsof -i :9090 (Linux/macOS) — and kill it if it’s another inferhost from earlier.
Or change the port in .env:

INFERHOST_SWAP_PORT=9099

Then re-launch inferhost.

`curl http://<lan-ip>:9090/...` doesn’t work

This is expected. In v0.5+, llama-swap binds 127.0.0.1 (loopback) only and is not reachable from the network by design. It is an internal component.

Use the LiteLLM gateway on port 9001 instead — that is the single user-facing endpoint:

curl http://<lan-ip>:9001/v1/chat/completions ...

If you need to change the gateway port, set INFERHOST_GATEWAY_PORT in .env.

The model fails to start when I make a request

Open the TUI and look at the Logs panel — that’s the live tail of llama-swap.log. The most common errors:

Log message	Fix
`failed to load model`	The GGUF file may be incomplete. Remove and re-add the model.
`out of memory` / `CUDA error: out of memory`	Pick a smaller quant for this model, or set `INFERHOST_GPU_LAYERS` to a smaller number to offload less to the GPU. You can also try a lighter `INFERHOST_KV_QUANT` value.
`flash attention not supported`	Set `INFERHOST_FLASH_ATTENTION=off` in `.env`.

Prebuilt llama-server doesn’t match my platform

inferhost ships prebuilt llama-server binaries for three targets: Linux x86_64 CUDA 12.x, Linux x86_64 CPU, and macOS arm64 Metal. If you run Vulkan, ROCm, an older CUDA, or a platform not in that list, the prebuilt binary may not work.

Use the INFERHOST_LLAMA_SERVER_PATH escape hatch to point inferhost at a compatible binary you build or obtain yourself:

# Example: ROCm build from source
cd ~/llama.cpp
cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --target llama-server -j$(nproc)

# Tell inferhost to use it
export INFERHOST_LLAMA_SERVER_PATH=~/llama.cpp/build/bin/llama-server
inferhost

Or add it to .env so it persists:

INFERHOST_LLAMA_SERVER_PATH=/home/user/llama.cpp/build/bin/llama-server

When INFERHOST_LLAMA_SERVER_PATH is set, inferhost skips the binary download step entirely.

“Hugging Face repo not found”

Double-check the spelling. The repo id is the org/name shown at the top of the Hugging Face page, e.g. Qwen/Qwen2.5-7B-Instruct-GGUF. It must point to a repo containing GGUF files.

If the repo is gated or private, log in first:

huggingface-cli login

Then re-launch inferhost.

Download is slow

Hugging Face throttles unauthenticated downloads. Two fixes:

huggingface-cli login — authenticated downloads are faster.

Install hf_transfer:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
inferhost

The dashboard shows the wrong model name format

Names are derived from the repo id and the quant tag — they’re lowercase, dashes only. If you don’t like the auto-generated name, you can edit ~/.config/inferhost/models.toml directly and then press r to restart llama-swap.

I want to reset everything and start over

From the repo (development) directory:

./run.sh reset       # stops daemons and clears the registry (keeps GGUFs in HF cache)
./run.sh uninstall   # also removes the venv and the data dir

If you installed via pip install inferhost:

# Stop any daemons
pkill -f llama-swap || true
pkill -f litellm || true
# Wipe inferhost state (keeps the Hugging Face model cache)
rm -rf ~/.local/share/inferhost ~/.config/inferhost

My model isn’t on Hugging Face as GGUF

inferhost only supports GGUF (the format llama.cpp uses). If you have a model in safetensors / .bin, convert it first with llama.cpp’s conversion scripts, upload the GGUF to Hugging Face (or a local path), and then point inferhost at the repo.

I think I found a bug

Please open an issue on GitHub with:

The output of running python -c "import inferhost; print(inferhost.__version__)"
Your OS, Python version, and GPU
The relevant part of ~/.local/share/inferhost/logs/llama-swap.log

← Back to overview