Inference servers¶
Ollama¶
Local inference server - runs models on your machine behind a REST API.
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Start the server (runs on localhost:11434)
ollama serve
# Pull a model
ollama pull qwen2.5:3b
# List downloaded models
ollama list
# Run a model interactively
ollama run qwen2.5:3b
# Remove a model
ollama rm qwen2.5:3b
Note: if you are running the demos in this repo via a devcontainer as intended, you do not need to install Ollama. The container environment includes it.
Environment variables¶
Variable |
Purpose |
|---|---|
|
Directory where models are stored |
|
Server address (default |
llama.cpp¶
High-performance C/C++ inference engine. Runs GGUF-quantized models and can split MoE layers across CPU and GPU, making it possible to serve large models (100B+) on consumer hardware.
# Build from source with CUDA support (compiles for your GPU automatically)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Note: The build compiles CUDA kernels for the GPU(s) detected on your machine. This takes several minutes but only needs to be done once. If you change GPUs, rebuild.
# Start the server with CPU/GPU MoE split
# Replace <model.gguf> with the path to your GGUF file
llama.cpp/build/bin/llama-server \
-m <model.gguf> \
--n-gpu-layers 999 \
--n-cpu-moe <N> \
-c 0 --flash-attn on \
--jinja \
--host 0.0.0.0 --port 8502 --api-key "dummy"
See the Models section for complete, copy-paste run commands.
The server exposes an OpenAI-compatible API, so any OpenAI client library can connect to it.
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:8502/v1',
api_key='your-api-key',
)
response = client.chat.completions.create(
model='model-name',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Hello!'},
],
temperature=0.7,
)
print(response.choices[0].message.content)
Key server flags¶
Flag |
Purpose |
|---|---|
|
Path to the GGUF model file |
|
Number of layers to offload to GPU ( |
|
Number of MoE blocks to keep on CPU (e.g. |
|
Context length ( |
|
Enable flash attention |
|
Server bind address and port |
|
Enable Jinja chat templates (required for harmony and similar formats) |
|
API key for authenticating requests |
CPU/GPU MoE split explained¶
Mixture-of-Experts models have two types of layers: attention layers (small, benefit from GPU) and MoE/expert layers (large, run well on CPU). The --n-cpu-moe flag controls how many MoE blocks stay on CPU:
Config |
VRAM usage |
Generation speed |
|---|---|---|
|
~5-8 GB |
~18-25 tok/s |
|
~22 GB |
~25-31 tok/s |
This makes it possible to run a 120B parameter model with as little as 8 GB of VRAM and 64 GB of system RAM.