# Systemd deployment This guide covers deploying llama.cpp server as a systemd service on the host OS for production use. ## Prerequisites - Linux host with systemd - NVIDIA GPU with CUDA drivers installed - Root/sudo access ## Quick start ### 1. Build llama.cpp on host ```bash cd /opt sudo git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp sudo cmake -B build -DGGML_CUDA=ON sudo cmake --build build --config Release -j$(nproc) ``` > **Note:** Building takes several minutes and compiles CUDA kernels for your GPU. ### 2. Set up directories and models ```bash # Create model directory sudo mkdir -p /opt/models ``` Download models using one of these methods: **Option A: Download on host with HF_HOME set** ```bash # Set HF_HOME to download directly to /opt/models sudo HF_HOME=/opt/models python3 /path/to/llms-demo/utils/download_gpt_oss_20b.py ``` **Option B: Copy from dev container** If you already downloaded models in the dev container: ```bash # From your host OS (outside container) sudo cp -r /path/to/llms-demo/models/hugging_face /opt/models/ ``` **Option C: Manual download** Download GGUF files directly from HuggingFace: ```bash # Example for GPT-OSS-20B cd /opt/models sudo wget https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf ``` > **Note:** The download scripts respect the `HF_HOME` environment variable. Without setting it, models download to `~/.cache/huggingface/` by default. ### 3. Create service user ```bash sudo useradd -r -s /bin/false -d /opt/llama.cpp llama sudo chown -R llama:llama /opt/llama.cpp sudo chown -R llama:llama /opt/models ``` ### 4. Generate API key ```bash API_KEY=$(openssl rand -base64 32) echo "Your API key: $API_KEY" # Save this key securely! ``` ### 5. Install and configure service ```bash # Copy unit file to systemd sudo cp utils/llamacpp.service /etc/systemd/system/ # Edit the service file with your API key and model path sudo nano /etc/systemd/system/llamacpp.service # Update these lines: # - Replace YOUR_API_KEY_HERE with your generated key # - Replace the * in model path with actual snapshot hash # - Modify --n-cpu-moe if using MoE model ``` > **Important:** Systemd doesn't expand shell wildcards (`*`) in ExecStart. You must replace `snapshots/*/` with the actual hash directory. ### 6. Enable and start ```bash # Reload systemd sudo systemctl daemon-reload # Enable service to start on boot sudo systemctl enable llamacpp.service # Start the service sudo systemctl start llamacpp.service # Check status sudo systemctl status llamacpp.service ``` ## Monitoring ### View logs No log file is needed - systemd automatically captures all stdout/stderr output and forwards it to **journald** (the system journal). This is preferable to a log file: journald handles rotation automatically, logs survive if the service crashes before flushing, and you get structured querying by time, boot, and priority. ```bash # Follow logs in real-time sudo journalctl -u llamacpp.service -f # View last 100 lines sudo journalctl -u llamacpp.service -n 100 # View logs since boot sudo journalctl -u llamacpp.service -b ``` ### Check metrics ```bash curl -H "Authorization: Bearer YOUR_API_KEY" \ http://localhost:8502/metrics ``` ### Service management ```bash # Stop service sudo systemctl stop llamacpp.service # Restart service sudo systemctl restart llamacpp.service # Enable service (will start on boot) sudo systemctl enable llamacpp.service # Disable service (don't start on boot) sudo systemctl disable llamacpp.service # View service status sudo systemctl status llamacpp.service ``` ## Configuration ### Model-specific settings > **Note:** Replace `*` with the actual snapshot hash from your `/opt/models/hub/models--*/snapshots/` directory. Systemd doesn't expand wildcards. **GPT-OSS-120B** (120B MoE): ```bash ExecStart=/opt/llama.cpp/build/bin/llama-server \ -m PATH_TO_MODEL \ --n-gpu-layers 999 \ --n-cpu-moe 36 \ -c 8192 \ --flash-attn on \ --jinja \ --host 0.0.0.0 \ --port 8502 \ --api-key YOUR_API_KEY \ --metrics \ --log-timestamps ``` **GPT-OSS-20B** (21B): ```bash ExecStart=/opt/llama.cpp/build/bin/llama-server \ -m PATH_TO_MODEL \ --n-gpu-layers 999 \ -c 8192 \ --flash-attn on \ --jinja \ --host 0.0.0.0 \ --port 8502 \ --api-key YOUR_API_KEY \ --metrics \ --log-timestamps ``` **Qwen3.5-35B-A3B** (35B MoE): ```bash ExecStart=/opt/llama.cpp/build/bin/llama-server \ -m PATH_TO_MODEL \ --n-gpu-layers 999 \ --n-cpu-moe 40 \ -c 8192 \ --flash-attn on \ --jinja \ --host 0.0.0.0 \ --port 8502 \ --api-key YOUR_API_KEY \ --metrics \ --log-timestamps ``` ### Context length (`-c`) The `-c` flag sets the maximum context length in tokens (combined prompt + response). - `-c 8192` - 8K tokens (~6,000 words), suitable for most interactive use cases (~2-4 GB KV cache depending on model) - `-c 32768` - 32K tokens (~24,000 words), enough to fit an entire technical manual or codebase in a single context window (uses significantly more VRAM) - `-c 0` - use the model's maximum supported context length (often 128K+); **avoid this unless you have substantial free VRAM** - it will allocate a KV cache for the full context window at startup, which can exhaust GPU memory and force inference to fall back to CPU If the server starts but inference is unexpectedly slow with high CPU usage, an oversized context is the most likely cause. Start conservative (`-c 8192`) and increase only if needed. ## Troubleshooting ### OpenSSL warning during cmake If cmake prints the following warning, HTTPS support will be disabled: ``` CMake Warning at vendor/cpp-httplib/CMakeLists.txt:150 (message): OpenSSL not found, HTTPS support disabled ``` Install the OpenSSL development libraries and re-run cmake: ```bash # Debian/Ubuntu sudo apt install -y libssl-dev # RHEL/Fedora sudo dnf install -y openssl-devel # Then re-run cmake cd /opt/llama.cpp sudo cmake -B build -DGGML_CUDA=ON sudo cmake --build build --config Release -j$(nproc) ``` ### Service won't start Check logs for errors: ```bash sudo journalctl -u llamacpp.service -n 50 ``` Common issues: - **Model file not found**: Systemd doesn't expand wildcards (`*`). Find the actual snapshot hash: ```bash sudo ls /opt/models/hub/models--ggml-org--gpt-oss-20b-GGUF/snapshots/ ``` Then replace `snapshots/*/` with `snapshots/ACTUAL_HASH/` in the ExecStart line. - **Permission denied**: Check ownership with `ls -la /opt/llama.cpp` - **CUDA errors**: Ensure NVIDIA drivers are installed (`nvidia-smi`) - **Port already in use**: Check if another process is using port 8502 ### Performance issues Check resource usage: ```bash # CPU/memory top -p $(pgrep llama-server) # GPU nvidia-smi -l 1 # Detailed GPU stats nvidia-smi dmon -s u ``` ### Update llama.cpp ```bash # Stop service sudo systemctl stop llamacpp.service # Update and rebuild cd /opt/llama.cpp sudo git pull sudo cmake --build build --config Release -j$(nproc) # Start service sudo systemctl start llamacpp.service ``` ## Security considerations 1. **API key**: Use a strong random key (32+ characters) 2. **User isolation**: Run as dedicated `llama` user (not root) 3. **File permissions**: Ensure models are readable only by `llama` user