Configuration
inferhost reads every setting from environment variables, or from a .env file in the directory you run it from. No YAML, no JSON, no config CLI.
.env example
Drop a .env file next to wherever you launch inferhost (or in your project root):
# Ports
INFERHOST_SWAP_PORT=9090
INFERHOST_GATEWAY_PORT=9001
# Where binaries, logs, and configs live
INFERHOST_DATA_DIR=~/.local/share/inferhost
INFERHOST_CONFIG_DIR=~/.config/inferhost
INFERHOST_HF_CACHE=~/.cache/huggingface
# Inference defaults
INFERHOST_GPU_LAYERS=99 # offload everything to GPU
INFERHOST_DEFAULT_CTX=8192
INFERHOST_FLASH_ATTENTION=on
INFERHOST_PARALLEL_SLOTS=1 # --parallel; 1 = serial requests per model
# Reasoning / "thinking" mode for capable models
INFERHOST_REASONING=auto # auto | on | off
INFERHOST_REASONING_BUDGET=-1 # token cap on thinking; -1 = unlimited, 0 = none
# Pin specific upstream releases (default: latest)
INFERHOST_LLAMACPP_VERSION=latest
INFERHOST_LLAMASWAP_VERSION=latest
# Force a GPU backend (default: auto-detect)
# INFERHOST_LLAMACPP_BACKEND=cuda
# Stacked speculative decoding (only applied to MTP-capable models).
# Set any value to 0 to disable that lane.
INFERHOST_SPEC_DRAFT_N_MAX=2 # MTP draft tokens per step
INFERHOST_SPEC_NGRAM_MOD_N_MATCH=24 # min matching length before ngram drafts
INFERHOST_SPEC_NGRAM_MOD_N_MIN=48 # min context window to search back through
INFERHOST_SPEC_NGRAM_MOD_N_MAX=64 # max ngram draft tokens on a strong match
Full reference
| Variable | Default | What it does |
|---|---|---|
INFERHOST_SWAP_PORT |
9090 |
The user-facing OpenAI-compatible endpoint port (llama-swap). |
INFERHOST_GATEWAY_PORT |
9001 |
LiteLLM gateway port (when the gateway extra is installed). |
INFERHOST_DATA_DIR |
~/.local/share/inferhost |
Where downloaded binaries, logs, and PID files live. |
INFERHOST_CONFIG_DIR |
~/.config/inferhost |
Where the generated llama-swap.yaml and the model registry live. |
INFERHOST_HF_CACHE |
~/.cache/huggingface |
Hugging Face model cache root. |
INFERHOST_GPU_LAYERS |
99 |
The -ngl flag passed to llama-server (number of layers offloaded to GPU). 99 ≈ “everything that fits”. |
INFERHOST_DEFAULT_CTX |
8192 |
Default context length for newly added models. |
INFERHOST_FLASH_ATTENTION |
on |
Pass -fa to llama-server. Set to off if your GPU doesn’t support it. |
INFERHOST_PARALLEL_SLOTS |
1 |
Pass --parallel <n> to llama-server. Each slot can handle one in-flight request on the same model. Keep at 1 unless you actually need concurrency. |
INFERHOST_REASONING |
auto |
--reasoning flag for thinking-capable models (DeepSeek, Qwen3-Thinking, GPT-OSS, …). auto lets the model decide, on forces thinking, off suppresses it. |
INFERHOST_REASONING_BUDGET |
-1 |
--reasoning-budget — token cap on thinking. -1 = unlimited, 0 = none, positive = hard cut-off. |
INFERHOST_LLAMACPP_BACKEND |
auto | Force the backend: vulkan, cuda, rocm, sycl, openvino, or cpu. |
INFERHOST_LLAMACPP_VERSION |
latest |
Pin a specific llama.cpp release tag. |
INFERHOST_LLAMASWAP_VERSION |
latest |
Pin a specific llama-swap release tag. |
INFERHOST_SPEC_DRAFT_N_MAX |
2 |
MTP draft tokens per step (--spec-draft-n-max). Only applied to models with mtp in the filename. Set to 0 to disable the MTP lane. |
INFERHOST_SPEC_NGRAM_MOD_N_MATCH |
24 |
Min matching sequence length before ngram-mod drafts (--spec-ngram-mod-n-match). |
INFERHOST_SPEC_NGRAM_MOD_N_MIN |
48 |
Min context window ngram-mod searches back through (--spec-ngram-mod-n-min). |
INFERHOST_SPEC_NGRAM_MOD_N_MAX |
64 |
Max draft tokens ngram-mod proposes on a strong match (--spec-ngram-mod-n-max). Set to 0 to disable the ngram-mod lane. |
How auto-detection works
If you don’t set INFERHOST_LLAMACPP_BACKEND, inferhost runs a small probe at install time:
- NVIDIA? Prefer CUDA, fall back to Vulkan.
- AMD? Prefer ROCm, fall back to Vulkan.
- Intel discrete? Prefer SYCL / OpenVINO.
- Apple Silicon? Use the universal macOS Metal build.
- No GPU? CPU.
If a preferred backend has no prebuilt asset for your platform, inferhost falls back to the next best. Run with the env override above if you want to pin it.
Changing settings
Any change to a .env value or env var takes effect the next time inferhost (or ./run.sh start) launches the TUI / daemon. After changing INFERHOST_SWAP_PORT, press r in the TUI (restart) to rebind the daemon.