Models¶

openai/gpt-oss-120b¶

A 120B parameter Mixture-of-Experts (MoE) model released by OpenAI. We use the mxfp4-quantized GGUF version published by ggml-org - the organization behind llama.cpp, the GGML tensor library, and the GGUF model format.

Detail	Value
Parameters	120B (Mixture-of-Experts)
Quantization	mxfp4 (expert layers), BF16 (attention layers)
Format	GGUF (3 shards)
Download size	~60 GB
Min system RAM	64 GB (96 GB recommended)
Min VRAM	~5 GB (attention layers only, with `--n-cpu-moe 36`)

# Download the GGUF model
python utils/download_gpt_oss_120b.py

Response format: Harmony¶

GPT-OSS was trained on OpenAI’s harmony response format. The model uses internal “channels” (e.g. analysis for chain-of-thought, final for the actual response). llama.cpp auto-detects and parses harmony, separating thinking into reasoning_content and the clean response into content.

You can control reasoning effort by adding one of these lines at the top of the system prompt:

Reasoning: low - fast responses, minimal thinking
Reasoning: medium - balanced speed and detail
Reasoning: high - deep, detailed analysis

Run¶

llama.cpp/build/bin/llama-server \
    -m models/hugging_face/hub/models--ggml-org--gpt-oss-120b-GGUF/snapshots/*/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 36 \
    -c 0 --flash-attn on \
    --jinja \
    --host 0.0.0.0 --port 8502 --api-key "dummy"

The model has 36 MoE blocks. --n-cpu-moe 36 keeps all expert layers on CPU (lowest VRAM, ~5 GB). Reduce the value to move MoE blocks to GPU if you have VRAM to spare.

openai/gpt-oss-20b¶

The smaller sibling of GPT-OSS-120B, designed for lower latency and local use cases. At ~11 GB it fits entirely in GPU memory on many consumer GPUs (no CPU MoE split needed), delivering ~50 tok/s generation. Uses the same harmony response format as the 120B model.

Detail	Value
Parameters	21B total, 3.6B active (Mixture-of-Experts)
Quantization	mxfp4 (expert layers), BF16 (attention layers)
Format	GGUF (single file)
Download size	~11 GB
Min VRAM	~14 GB (fits entirely on GPU)

# Download the GGUF model
python utils/download_gpt_oss_20b.py

Response format: Harmony¶

Same as GPT-OSS-120B (see above).

Run¶

llama.cpp/build/bin/llama-server \
    -m models/hugging_face/hub/models--ggml-org--gpt-oss-20b-GGUF/snapshots/*/gpt-oss-20b-mxfp4.gguf \
    --n-gpu-layers 999 \
    -c 8192 --flash-attn on \
    --jinja \
    --host 0.0.0.0 --port 8502 --api-key "dummy"

-c 8192 sets the context length to 8,192 tokens (~6,000 words). Increase this to -c 32768 for longer conversations (~24,000 words — enough for an entire technical manual or codebase in a single prompt), at the cost of more VRAM. Use -c 0 to let llama.cpp use the model’s full supported context length automatically.

No --n-cpu-moe needed - the model fits entirely in GPU memory.

Qwen3.5-35B-A3B¶

A 35B parameter Mixture-of-Experts vision-language model from Alibaba’s Qwen team, with only 3B active parameters per token. Smaller and faster than GPT-OSS-120B, making it a good choice when serving multiple concurrent users. We use the mxfp4-quantized GGUF version by noctrex.

Detail	Value
Parameters	35B total, 3B active (Mixture-of-Experts)
Quantization	mxfp4 (expert layers), BF16 (attention layers)
Format	GGUF (single file)
Download size	~22 GB
Vision support	Yes (with mmproj-BF16.gguf projection file)
Min system RAM	32 GB
Min VRAM	~3 GB (attention layers only, with `--n-cpu-moe`)

# Download the GGUF model
python utils/download_qwen35_35b.py

Response format¶

Qwen3.5 uses <think>...</think> tags for chain-of-thought reasoning (the same convention as DeepSeek). llama.cpp auto-detects this and separates thinking into reasoning_content. To disable thinking and get direct responses, add /no_think to the end of your user message.

Run¶

llama.cpp/build/bin/llama-server \
    -m models/hugging_face/hub/models--noctrex--Qwen3.5-35B-A3B-MXFP4_MOE-GGUF/snapshots/*/Qwen3.5-35B-A3B-MXFP4_MOE_BF16.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 40 \
    -c 0 --flash-attn on \
    --jinja \
    --host 0.0.0.0 --port 8502 --api-key "dummy"

The model has 40 MoE blocks. --n-cpu-moe 40 keeps all expert layers on CPU. This model is much smaller (~22 GB) and faster than GPT-OSS-120B, making it a good choice for consumer hardware.