Demos

This repository includes six chatbot demos that demonstrate different approaches to local LLM inference. Each demo covers specific concepts and tools.

Demo 1: HuggingFace chatbot

File: demos/chatbots/huggingface_chatbot.py

Concepts covered:

  • Direct model loading (no inference server)

  • Chat templates and tokenization

  • Generation parameters (temperature, max tokens)

  • Decoding and response formatting

Tools used:

Running the demo:

# 1. Run the chatbot (downloads model on first run)
python demos/chatbots/huggingface_chatbot.py

# Note: This loads the model directly into memory (no inference server needed).
# First run will download approximately 6GB of model files to models/hugging_face/

Demo 2: Ollama chatbot

File: demos/chatbots/ollama_chatbot.py

Concepts covered:

  • Using a local inference server (Ollama)

  • Structured message types (SystemMessage, HumanMessage, AIMessage)

  • Conversation history management

  • Terminal-based interaction

Tools used:

Running the demo:

# 1. Start the Ollama server in a terminal
ollama serve

# 2. Pull a model (in another terminal)
ollama pull qwen2.5:3b

# 3. Run the chatbot
python demos/chatbots/ollama_chatbot.py

Demo 3: llama.cpp chatbot

File: demos/chatbots/llamacpp_chatbot.py

Concepts covered:

  • Running large MoE models (120B+ parameters) on consumer hardware

  • CPU/GPU memory split for expert layers

  • OpenAI-compatible API usage

  • Remote vs. local inference servers

Tools used:

  • llama.cpp - High-performance C++ inference engine

  • OpenAI Python client - Standard API interface

Running the demo:

You have two choices: use the hosted model at gpt.perdrizet.org, or run llama.cpp locally.

Option 1: Using the remote server (recommended for quick start)

# 1. Create a .env file with your API credentials
cp .env.example .env

# 2. Edit .env and set:
#    PERDRIZET_URL=gpt.perdrizet.org
#    PERDRIZET_API_KEY=your-api-key-here

# 3. Run the chatbot
python demos/chatbots/llamacpp_chatbot.py

Option 2: Running llama.cpp locally

# 1. Download a GGUF model (e.g., GPT-OSS-20B)
python utils/download_gpt_oss_20b.py

# 2. Build llama.cpp with CUDA support
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
cd ..

# 3. Start the llama-server (see model-specific commands in the Models section)
llama.cpp/build/bin/llama-server -m <model.gguf> <flags...>

# 4. Run the chatbot (in another terminal)
python demos/chatbots/llamacpp_chatbot.py

Note: For localhost, the defaults work automatically (localhost:8502 with “dummy” API key). For remote servers, configure PERDRIZET_URL and PERDRIZET_API_KEY in your .env file.

Demo 4: Gradio chatbot

File: demos/chatbots/gradio_chatbot.py

Concepts covered:

  • Web-based chat interfaces

  • Multi-backend architecture (switching between Ollama/llama.cpp)

  • System prompt customization

  • Error handling and user feedback

Tools used:

Running the demo:

# 1. Start the Ollama server in a terminal
ollama serve

# 2. Pull a model (in another terminal)
ollama pull qwen2.5:3b

# 3. Run the Gradio chatbot
python demos/chatbots/gradio_chatbot.py

# 4. Open the URL shown in the terminal (usually http://127.0.0.1:7860)

Demo 5: LangChain basics

File: demos/langchain_patterns/langchain_demo.py

Concepts covered:

  • Chat models and LLM wrappers

  • Prompt templates with variable substitution

  • Structured output parsing with Pydantic schemas

  • Basic chains and composition with LCEL

  • Few-shot learning patterns

Tools used:

Running the demo:

# 1. Start the Ollama server in a terminal
ollama serve

# 2. Pull a model (in another terminal)
ollama pull qwen2.5:3b

# 3. Run the LangChain demo
python demos/langchain_patterns/langchain_demo.py

# 4. Open the URL shown in the terminal (usually http://127.0.0.1:7860)

Four interactive examples:

  1. Simple chain: Prompt template → LLM → String output

    • Try: “machine learning”, “photosynthesis”, “blockchain”

  2. Sentiment analysis: Structured JSON output with Pydantic schema

    • Try: Product reviews, comments, social media posts

    • See how the parser extracts sentiment, confidence, and key phrases

  3. Entity extraction: Different schemas for different entity types

    • Person: name, age, occupation, location

    • Recipe: name, cuisine, ingredients, difficulty

    • Switch schemas to see how the same chain extracts different information

  4. Few-shot learning: Style classification with examples

    • The model learns from 4 in-prompt examples

    • Try: Technical, casual, formal, or creative writing styles

What to observe:

  • Reusability: Same chain works for multiple inputs

  • Type safety: Pydantic schemas ensure structured outputs

  • Composability: Chains combine prompt, model, and parser seamlessly

  • Format instructions: See how Pydantic schemas generate parsing guidance

Demo 6: ReAct agent chatbot

Files:

  • demos/langchain_patterns/react_agent_chatbot.py - Uses LangChain’s agent framework

  • demos/langchain_patterns/react_agent_chatbot_manual.py - Manual implementation from scratch

Concepts covered:

  • ReAct (Reasoning + Acting) agent pattern

  • Multi-step reasoning with tool use

  • Tool selection and execution

  • Agent iteration loops and error handling

  • Comparing high-level frameworks vs. manual implementation

Tools used:

Two versions available:

This demo includes both a production-ready implementation and an educational version that reveals the inner workings:

  1. Built-in agent (react_agent_chatbot.py): Uses LangChain’s create_agent() API for automatic ReAct pattern handling. This is the recommended approach for real applications.

  2. Manual implementation (react_agent_chatbot_manual.py): A hand-coded ReAct loop with regex parsing that shows exactly what LangChain does behind the scenes. This version demonstrates:

    • How to prompt the LLM to follow the ReAct pattern

    • Parsing LLM responses to extract actions and answers

    • Manual tool execution and observation injection

    • The explicit iteration loop that drives the agent

    Use this version to understand the mechanics of agent frameworks before relying on them.

Running the demo:

Version 1: Built-in agent (recommended for beginners)

# 1. Start the Ollama server in a terminal
ollama serve

# 2. Pull a model (in another terminal)
ollama pull qwen2.5:3b

# 3. Run the ReAct agent chatbot
python demos/langchain_patterns/react_agent_chatbot.py

# 4. Open the URL shown in the terminal (usually http://127.0.0.1:7860)

Version 2: Manual implementation (educational)

# Same setup as Version 1, but run:
python demos/langchain_patterns/react_agent_chatbot_manual.py

# This version shows explicit Thought → Action → Observation cycles

Try these example questions:

  • “How many days until Christmas from today?”

  • “Calculate 15% tip on a $47.50 bill”

  • “I was born on March 15, 1990. How old am I in days?”

  • “What’s 25% of 360, divided by 3?”

  • “How many weeks between today and New Year’s Day 2027?”

What to observe:

  • Watch the Reasoning Process panel (right side) to see how the agent thinks

  • Notice when it decides to use tools vs. when it can answer directly

  • See the Thought → Action → Observation loop in action

  • Try asking multi-step questions that require multiple tool calls

  • Compare both versions: Run the same question through both demos to see how the manual implementation exposes the mechanics that LangChain handles automatically

Demo 7: RAG system

File: demos/rag_system/rag_demo.py

Concepts covered:

  • Retrieval-Augmented Generation (RAG) pipeline

  • Document ingestion, chunking, and embedding

  • Vector similarity search with pgvector

  • Grounded LLM responses with source citations

  • Modular ingestor pattern (BaseIngestor)

Tools used:

  • LangChain - RAG chain composition and retriever

  • HuggingFace - Local embedding model (all-MiniLM-L6-v2)

  • PostgreSQL + pgvector - Vector store

  • Ollama or llama.cpp - Backend LLM

  • Gradio - Web interface with Ingest / Query / Settings tabs

Running the demo:

# 1. Ensure PostgreSQL with pgvector is accessible and .env is configured
#    (DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME)

# 2. Start your LLM backend (llama.cpp is the default)
ollama serve   # or start llama-server

# 3. Run the RAG demo
python demos/rag_system/rag_demo.py

# 4. Open the URL shown in the terminal (usually http://127.0.0.1:7860)

Three tabs:

  1. Ingest: Choose a source (Wikipedia), enter a topic, and click Ingest to embed and store chunks in the knowledge base

  2. Query: Ask questions - the retriever finds the most relevant chunks and passes them as context to the LLM

  3. Settings: Switch between Ollama and llama.cpp backends; clear the vector store collection

What to observe:

  • The Sources panel shows which document chunks were retrieved for each answer

  • Ingest the same topic twice to see deduplication behaviour

  • Ask a question about something not ingested - notice how the grounded answer differs from a hallucinated one

  • Switch backends (Ollama vs. llama.cpp) to compare answer quality

Demo 8: Fine-tuning and alignment demo

File: demos/finetuning/finetuning_demo.py

Concepts covered:

  • Behavioral difference between a base model and its instruction-tuned counterpart

  • How the chat template links fine-tuning format to inference format

  • What SFT and DPO training data actually looks like (Alpaca JSON, ChatML, DPO preference pairs)

Tools used:

Running the demo:

# Models are downloaded from HuggingFace on first run (~500 MB each).
# Set HF_HOME to control the cache directory.

python demos/finetuning/finetuning_demo.py

# Open the URL shown in the terminal (usually http://127.0.0.1:7860)

Two tabs:

  1. Model comparison: The same prompt is sent to Qwen/Qwen2.5-0.5B (base, raw text completion) and Qwen/Qwen2.5-0.5B-Instruct (instruction-tuned, chat template) simultaneously - responses shown side by side

  2. Dataset formatter: Enter an instruction and ideal output; see it formatted as Alpaca JSON, ChatML, and DPO preference pairs

What to observe:

  • On the completion trap prompt (Things I need from the grocery store: 1. Milk 2. Eggs 3.) - the base model continues the list; the instruct model responds to the intent

  • The Sources column in the model table shows the actual HuggingFace checkpoint IDs - these are genuinely different weight files, not aliases

  • The chat template in Tab 2 shows exactly the format the instruct model was trained on: <|im_start|>system ... <|im_end|> tokens are what distinguishes the two checkpoints at the data level

Demo 9: LLM evaluation demo

File: demos/evaluation/evaluation_demo.py

Concepts covered:

  • Automated text metrics: ROUGE-1/2/L, BLEU, and BERTScore

  • Standardised multiple-choice benchmarking (MMLU-style)

  • LLM-as-judge rubric scoring with structured JSON output

  • Limitations of each approach and when to use each one

Tools used:

Running the demo:

# 1. Install evaluation dependencies
pip install evaluate bert-score

# 2. Start the Ollama server in a terminal
ollama serve

# 3. Pull the benchmark/judge model (in another terminal)
ollama pull qwen2.5:3b

# 4. Run the evaluation demo
python demos/evaluation/evaluation_demo.py

# 5. Open the URL shown in the terminal (usually http://127.0.0.1:7860)

Note: BERTScore downloads a ~400 MB BERT model on first use. Subsequent runs are instant.

Three tabs:

  1. Metric calculator: Enter a reference and candidate text; compute ROUGE-1/2/L, BLEU, and BERTScore F1 in one click. Pre-filled example illustrates the paraphrase problem.

  2. Mini benchmark: Run qwen2.5:3b against 10 MMLU-style questions (Science, History, Math, Coding); filter by category; see per-question pass/fail and a category breakdown.

  3. LLM-as-judge: Score a candidate answer on a 1-5 rubric (factual accuracy, relevance, completeness); the judge returns structured JSON parsed into a formatted score table.

What to observe:

  • In Tab 1: compare exact match, paraphrase, and factual error pairs — ROUGE and BERTScore diverge on the paraphrase (same meaning, different words)

  • In Tab 2: the model is instructed to reply with a single letter; see how it handles this constraint

  • In Tab 3: try the pre-filled “seasons misconception” answer — does the judge detect the factual error?

  • In Tab 3: compare the concise and padded candidate answers from the activity — does the judge exhibit verbosity bias?