Lab 1: Data cleaner container

Learning objectives

  • Understand the basic structure of a Dockerfile

  • Build your first Docker image

  • Run a containerized Python application

  • Use volume mounts to share data between host and container

  • See how containers enable modular ML pipeline components

What’s included

  • clean_data.py: A pandas-based data cleaning script

  • Dockerfile: Instructions to build the container image

  • sample_data.csv: Sample dataset with duplicates and missing values

  • .dockerignore: Files to exclude from the Docker build context

The scenario

Imagine you’re building a production ML pipeline. The data cleaning step is containerized as a separate component that can:

  • Run independently of other pipeline stages

  • Scale horizontally to process multiple datasets in parallel

  • Be updated without affecting training or serving containers

  • Run consistently across development, staging, and production environments

Step-by-step instructions

1. Examine the Dockerfile

Navigate to the example directory and open the Dockerfile:

cd 01-data-cleaner

Notice the structure:

FROM python:3.11-slim            # Start with Python base image
WORKDIR /app                     # Set working directory
RUN pip install pandas           # Install dependencies
COPY clean_data.py .             # Copy our script
CMD ["python", "clean_data.py"]  # Default command

Each instruction creates a layer in the image. Docker caches layers, so rebuilds are fast when only later layers change.

2. Build the Docker image

From the 01-data-cleaner directory, run:

docker build -t data-cleaner .

What’s happening:

  • -t data-cleaner: Tags the image with the name “data-cleaner”

  • .: Build context is the current directory

Watch Docker execute each Dockerfile instruction and create layers.

3. Verify the image

List your Docker images:

docker images | grep data-cleaner

You should see your newly created image with its size and creation time.

4. Prepare data directory

Create directories for input and output:

mkdir -p data/input data/output
cp sample_data.csv data/input/raw_data.csv
mkdir -p data/input, data/output
cp sample_data.csv data/input/raw_data.csv
mkdir data\input data\output
copy sample_data.csv data\input\raw_data.csv

5. Run the container

docker run --rm -v "$(pwd)/data:/data" data-cleaner
docker run --rm -v "${PWD}/data:/data" data-cleaner
docker run --rm -v "%cd%/data:/data" data-cleaner

What’s happening:

  • --rm: Automatically remove the container when it exits

  • -v: Mount your local data/ directory to /data in the container (syntax varies by OS/shell)

  • data-cleaner: The image to run

6. Examine the output

The script displays:

  • Original dataset statistics

  • Missing values found

  • Cleaning operations performed

  • Cleaned dataset summary

Check the cleaned file:

cat data/output/cleaned_data.csv
type data\output\cleaned_data.csv

Notice:

  • Duplicates removed (Alice and Bob appeared twice)

  • Rows with missing values removed (David had no age, Eve had no score)

  • Whitespace trimmed (Frank’s city “ Austin “ became “Austin”)

Key concepts

  • Dockerfile basics: FROM, WORKDIR, RUN, COPY, CMD

  • Image building: Creating reusable templates for containers

  • Volume mounts: Sharing data between host and container

  • Container isolation: The script runs in its own environment with only pandas installed

  • Modularity: This component could be part of a larger pipeline, processing data before training

Experiment further

Try these modifications:

  1. Change the cleaning logic: Edit clean_data.py to fill missing values instead of dropping them

  2. Rebuild and rerun: Notice Docker’s layer caching makes rebuilds fast

  3. Process different data: Replace sample_data.csv with your own CSV file

  4. Check container cleanup: Run docker ps -a to verify --rm removed the container

What’s next?

In Lab 2: Streamlit dashboard container, you’ll containerize an interactive web application and learn about port mapping to access apps running inside containers.

Troubleshooting: If you’re stuck, check that:

  • Docker Desktop is running

  • You’re in the 01-data-cleaner/ directory

  • The data directories were created correctly