Inference servers¶

Ollama¶

Local inference server - runs models on your machine behind a REST API.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Start the server (runs on localhost:11434)
ollama serve

# Pull a model
ollama pull qwen2.5:3b

# List downloaded models
ollama list

# Run a model interactively
ollama run qwen2.5:3b

# Remove a model
ollama rm qwen2.5:3b

Note: if you are running the demos in this repo via a devcontainer as intended, you do not need to install Ollama. The container environment includes it.

Environment variables¶

Variable	Purpose
`OLLAMA_MODELS`	Directory where models are stored
`OLLAMA_HOST`	Server address (default `127.0.0.1:11434`)

llama.cpp¶

High-performance C/C++ inference engine. Runs GGUF-quantized models and can split MoE layers across CPU and GPU, making it possible to serve large models (100B+) on consumer hardware.

# Build from source with CUDA support (compiles for your GPU automatically)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Note: The build compiles CUDA kernels for the GPU(s) detected on your machine. This takes several minutes but only needs to be done once. If you change GPUs, rebuild.

# Start the server with CPU/GPU MoE split
# Replace <model.gguf> with the path to your GGUF file
llama.cpp/build/bin/llama-server \
    -m <model.gguf> \
    --n-gpu-layers 999 \
    --n-cpu-moe <N> \
    -c 0 --flash-attn on \
    --jinja \
    --host 0.0.0.0 --port 8502 --api-key "dummy"

See the Models section for complete, copy-paste run commands.

The server exposes an OpenAI-compatible API, so any OpenAI client library can connect to it.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:8502/v1',
    api_key='your-api-key',
)

response = client.chat.completions.create(
    model='model-name',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Hello!'},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

Key server flags¶

Flag	Purpose
`-m`	Path to the GGUF model file
`--n-gpu-layers`	Number of layers to offload to GPU (`999` = all non-MoE layers)
`--n-cpu-moe`	Number of MoE blocks to keep on CPU (e.g. `36` = all MoE on CPU)
`-c`	Context length (`0` = model maximum)
`--flash-attn`	Enable flash attention
`--host` / `--port`	Server bind address and port
`--jinja`	Enable Jinja chat templates (required for harmony and similar formats)
`--api-key`	API key for authenticating requests

CPU/GPU MoE split explained¶

Mixture-of-Experts models have two types of layers: attention layers (small, benefit from GPU) and MoE/expert layers (large, run well on CPU). The --n-cpu-moe flag controls how many MoE blocks stay on CPU:

Config	VRAM usage	Generation speed
`--n-cpu-moe 36` (all MoE on CPU)	~5-8 GB	~18-25 tok/s
`--n-cpu-moe 28` (8 MoE on GPU)	~22 GB	~25-31 tok/s

This makes it possible to run a 120B parameter model with as little as 8 GB of VRAM and 64 GB of system RAM.