Home¶
Example code demonstrating local LLM inference with various backends and libraries.
Overview¶
This repository contains chatbot demos and hands-on activities for learning prompt engineering and local LLM deployment.
Demos¶
Chatbots (demos/chatbots/)
HuggingFace chatbot: Direct model loading with Transformers — no inference server needed
Ollama chatbot: Terminal chatbot using LangChain + a local Ollama server
llama.cpp chatbot: Terminal chatbot using the OpenAI-compatible llama.cpp API
Gradio chatbot: Web UI with switchable Ollama / llama.cpp backends and customizable system prompt
LangChain patterns (demos/langchain_patterns/)
LangChain demo: Prompt templates, output parsers, LCEL chains, and few-shot learning
ReAct agent: LangChain agent with custom tools and multi-step reasoning (two versions: framework and manual)
RAG system (demos/rag_system/)
RAG demo: Ingest Wikipedia articles into a pgvector knowledge base and query them with a grounded LLM
Infrastructure¶
Inference servers: Ollama (lightweight), llama.cpp (high-performance MoE)
Libraries: Transformers, LangChain, Gradio
Models: GPT-OSS-120B (120B MoE), GPT-OSS-20B (21B), Qwen3.5-35B-A3B (35B MoE with vision) — any GGUF-compatible model can be added by downloading the weights and pointing the server at them
Get started¶
See the Quickstart guide for installation and setup, then explore the Demos to learn about different inference approaches.