Dockerfile guide

A Dockerfile contains instructions executed in order to build an image. Understanding these instructions is key to creating effective containerized applications.

Essential instructions

FROM

Specifies the base image to build upon.

FROM python:3.11-slim

Best practice: Use specific tags (3.11-slim) instead of latest for reproducibility.

WORKDIR

Sets the working directory inside the container. All subsequent commands run from this directory.

WORKDIR /app

Tip: Creates the directory if it doesn’t exist.

COPY

Copies files from your host machine to the container.

COPY requirements.txt /app/
COPY . /app

Pattern: Copy dependencies first, then code (for better caching).

RUN

Executes commands during the image build process. Commonly used to install dependencies.

RUN pip install --no-cache-dir -r requirements.txt

Best practice: Combine related commands with && to reduce layers.

CMD

Specifies the default command to run when a container starts.

CMD ["python", "app.py"]

Note: Only one CMD per Dockerfile. Easily overridden at runtime.

Important: Containers run as long as the command is running. When the command exits, the container stops. If you don’t specify a CMD (or run with -it for interactive mode), the container will start and immediately exit. For long-running services like web apps, the CMD should start a server that keeps running. For batch jobs, the container exits when processing completes.

EXPOSE

Documents which ports the container listens on.

EXPOSE 8080

Important: This doesn’t actually publish the port - use -p when running.

ENV

Sets environment variables.

ENV PYTHONUNBUFFERED=1
ENV MODEL_PATH=/models

USER

Sets the user for running subsequent commands.

RUN useradd -m appuser
USER appuser

Security: Never run as root in production!

Complete example

Here’s a well-structured Dockerfile for an ML application:

# Use specific Python version
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies first (for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Create non-root user
RUN useradd -m -s /bin/bash mluser && \
    chown -R mluser:mluser /app
USER mluser

# Set environment variables
ENV PYTHONUNBUFFERED=1

# Expose port
EXPOSE 8000

# Run application
CMD ["python", "app.py"]

Best practices

Layer caching

Docker caches each layer. Order instructions from least to most frequently changing:

# Good - dependencies change less often than code
FROM python:3.11-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

# Bad - code changes invalidate dependency cache
FROM python:3.11-slim
COPY . .
RUN pip install -r requirements.txt

Minimize layers

Combine related RUN commands:

# Good - one layer
RUN apt-get update && \
    apt-get install -y git curl && \
    rm -rf /var/lib/apt/lists/*

# Bad - three layers
RUN apt-get update
RUN apt-get install -y git curl
RUN rm -rf /var/lib/apt/lists/*

Use .dockerignore

Exclude unnecessary files from the build context:

# .dockerignore
__pycache__/
*.pyc
.git/
.venv/
*.md
.DS_Store

Keep images small

# Use slim/alpine variants
FROM python:3.11-slim  # ~120 MB
# vs
FROM python:3.11       # ~900 MB

# Clean up in same layer
RUN apt-get update && \
    apt-get install -y curl && \
    rm -rf /var/lib/apt/lists/*

# Use --no-cache-dir with pip
RUN pip install --no-cache-dir pandas

Specific tags

# Good - reproducible
FROM python:3.11.5-slim

# Bad - breaks when "latest" updates
FROM python:latest

Security

# Create and use non-root user
RUN groupadd -r appgroup && \
    useradd -r -g appgroup appuser
USER appuser

# Don't expose unnecessary ports
# Only EXPOSE what's needed

Common patterns

ML training container

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
WORKDIR /workspace

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY train.py .
COPY data/ data/

CMD ["python", "train.py"]

Model serving container

FROM python:3.11-slim
WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model.pkl .
COPY serve.py .

EXPOSE 8000
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]

Jupyter notebook container

FROM jupyter/scipy-notebook:latest

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

EXPOSE 8888
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root"]

Debugging tips

Build with verbose output

docker build --progress=plain --no-cache -t my-image .

Inspect intermediate layers

# Get layer IDs
docker history my-image

# Run shell in specific layer
docker run -it <layer-id> /bin/bash

Check build context size

# See what's being sent to Docker daemon
docker build --no-cache -t test . 2>&1 | grep "Sending build context"

Next steps

Now that you understand Dockerfiles, try the hands-on labs:

Lab 1: Data cleaner container - Basic Dockerfile
Lab 2: Streamlit dashboard container - Web app with port mapping
Lab 3: ML development container - Dev container configuration