Models

openai/gpt-oss-120b

A 120B parameter Mixture-of-Experts (MoE) model released by OpenAI. We use the mxfp4-quantized GGUF version published by ggml-org - the organization behind llama.cpp, the GGML tensor library, and the GGUF model format.

Detail

Value

Parameters

120B (Mixture-of-Experts)

Quantization

mxfp4 (expert layers), BF16 (attention layers)

Format

GGUF (3 shards)

Download size

~60 GB

Min system RAM

64 GB (96 GB recommended)

Min VRAM

~5 GB (attention layers only, with --n-cpu-moe 36)

# Download the GGUF model
python utils/download_gpt_oss_120b.py

Response format: Harmony

GPT-OSS was trained on OpenAI’s harmony response format. The model uses internal “channels” (e.g. analysis for chain-of-thought, final for the actual response). llama.cpp auto-detects and parses harmony, separating thinking into reasoning_content and the clean response into content.

You can control reasoning effort by adding one of these lines at the top of the system prompt:

  • Reasoning: low - fast responses, minimal thinking

  • Reasoning: medium - balanced speed and detail

  • Reasoning: high - deep, detailed analysis

Run

llama.cpp/build/bin/llama-server \
    -m models/hugging_face/hub/models--ggml-org--gpt-oss-120b-GGUF/snapshots/*/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 36 \
    -c 0 --flash-attn on \
    --jinja \
    --host 0.0.0.0 --port 8502 --api-key "dummy"

The model has 36 MoE blocks. --n-cpu-moe 36 keeps all expert layers on CPU (lowest VRAM, ~5 GB). Reduce the value to move MoE blocks to GPU if you have VRAM to spare.


openai/gpt-oss-20b

The smaller sibling of GPT-OSS-120B, designed for lower latency and local use cases. At ~11 GB it fits entirely in GPU memory on many consumer GPUs (no CPU MoE split needed), delivering ~50 tok/s generation. Uses the same harmony response format as the 120B model.

Detail

Value

Parameters

21B total, 3.6B active (Mixture-of-Experts)

Quantization

mxfp4 (expert layers), BF16 (attention layers)

Format

GGUF (single file)

Download size

~11 GB

Min VRAM

~14 GB (fits entirely on GPU)

# Download the GGUF model
python utils/download_gpt_oss_20b.py

Response format: Harmony

Same as GPT-OSS-120B (see above).

Run

llama.cpp/build/bin/llama-server \
    -m models/hugging_face/hub/models--ggml-org--gpt-oss-20b-GGUF/snapshots/*/gpt-oss-20b-mxfp4.gguf \
    --n-gpu-layers 999 \
    -c 8192 --flash-attn on \
    --jinja \
    --host 0.0.0.0 --port 8502 --api-key "dummy"

-c 8192 sets the context length to 8,192 tokens (~6,000 words). Increase this to -c 32768 for longer conversations (~24,000 words — enough for an entire technical manual or codebase in a single prompt), at the cost of more VRAM. Use -c 0 to let llama.cpp use the model’s full supported context length automatically.

No --n-cpu-moe needed - the model fits entirely in GPU memory.


Qwen3.5-35B-A3B

A 35B parameter Mixture-of-Experts vision-language model from Alibaba’s Qwen team, with only 3B active parameters per token. Smaller and faster than GPT-OSS-120B, making it a good choice when serving multiple concurrent users. We use the mxfp4-quantized GGUF version by noctrex.

Detail

Value

Parameters

35B total, 3B active (Mixture-of-Experts)

Quantization

mxfp4 (expert layers), BF16 (attention layers)

Format

GGUF (single file)

Download size

~22 GB

Vision support

Yes (with mmproj-BF16.gguf projection file)

Min system RAM

32 GB

Min VRAM

~3 GB (attention layers only, with --n-cpu-moe)

# Download the GGUF model
python utils/download_qwen35_35b.py

Response format

Qwen3.5 uses <think>...</think> tags for chain-of-thought reasoning (the same convention as DeepSeek). llama.cpp auto-detects this and separates thinking into reasoning_content. To disable thinking and get direct responses, add /no_think to the end of your user message.

Run

llama.cpp/build/bin/llama-server \
    -m models/hugging_face/hub/models--noctrex--Qwen3.5-35B-A3B-MXFP4_MOE-GGUF/snapshots/*/Qwen3.5-35B-A3B-MXFP4_MOE_BF16.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 40 \
    -c 0 --flash-attn on \
    --jinja \
    --host 0.0.0.0 --port 8502 --api-key "dummy"

The model has 40 MoE blocks. --n-cpu-moe 40 keeps all expert layers on CPU. This model is much smaller (~22 GB) and faster than GPT-OSS-120B, making it a good choice for consumer hardware.