Lab 1: Data cleaner container

Note

Learning objectives

Understand the basic structure of a Dockerfile
Build your first Docker image
Run a containerized Python application
Use volume mounts to share data between host and container
See how containers enable modular ML pipeline components

What’s included

clean_data.py: A pandas-based data cleaning script
Dockerfile: Instructions to build the container image
sample_data.csv: Sample dataset with duplicates and missing values
.dockerignore: Files to exclude from the Docker build context

The scenario

Imagine you’re building a production ML pipeline. The data cleaning step is containerized as a separate component that can:

Run independently of other pipeline stages
Scale horizontally to process multiple datasets in parallel
Be updated without affecting training or serving containers
Run consistently across development, staging, and production environments

Step-by-step instructions

1. Examine the Dockerfile

Navigate to the example directory and open the Dockerfile:

cd 01-data-cleaner

Notice the structure:

FROM python:3.11-slim            # Start with Python base image
WORKDIR /app                     # Set working directory
RUN pip install pandas           # Install dependencies
COPY clean_data.py .             # Copy our script
CMD ["python", "clean_data.py"]  # Default command

Each instruction creates a layer in the image. Docker caches layers, so rebuilds are fast when only later layers change.

2. Build the Docker image

From the 01-data-cleaner directory, run:

docker build -t data-cleaner .

What’s happening:

-t data-cleaner: Tags the image with the name “data-cleaner”
.: Build context is the current directory

Watch Docker execute each Dockerfile instruction and create layers.

3. Verify the image

List your Docker images:

docker images | grep data-cleaner

You should see your newly created image with its size and creation time.

4. Prepare data directory

Create directories for input and output:

Linux/macOS

mkdir -p data/input data/output
cp sample_data.csv data/input/raw_data.csv

Windows PowerShell

mkdir -p data/input, data/output
cp sample_data.csv data/input/raw_data.csv

Windows CMD

mkdir data\input data\output
copy sample_data.csv data\input\raw_data.csv

5. Run the container

Linux/macOS

docker run --rm -v "$(pwd)/data:/data" data-cleaner

Windows PowerShell

docker run --rm -v "${PWD}/data:/data" data-cleaner

Windows CMD

docker run --rm -v "%cd%/data:/data" data-cleaner

What’s happening:

--rm: Automatically remove the container when it exits
-v: Mount your local data/ directory to /data in the container (syntax varies by OS/shell)
data-cleaner: The image to run

6. Examine the output

The script displays:

Original dataset statistics
Missing values found
Cleaning operations performed
Cleaned dataset summary

Check the cleaned file:

Linux/macOS

cat data/output/cleaned_data.csv

Windows

type data\output\cleaned_data.csv

Notice:

Duplicates removed (Alice and Bob appeared twice)
Rows with missing values removed (David had no age, Eve had no score)
Whitespace trimmed (Frank’s city “ Austin “ became “Austin”)

Key concepts

Dockerfile basics: FROM, WORKDIR, RUN, COPY, CMD
Image building: Creating reusable templates for containers
Volume mounts: Sharing data between host and container
Container isolation: The script runs in its own environment with only pandas installed
Modularity: This component could be part of a larger pipeline, processing data before training

Experiment further

Try these modifications:

Change the cleaning logic: Edit clean_data.py to fill missing values instead of dropping them
Rebuild and rerun: Notice Docker’s layer caching makes rebuilds fast
Process different data: Replace sample_data.csv with your own CSV file
Check container cleanup: Run docker ps -a to verify --rm removed the container

What’s next?

In Lab 2: Streamlit dashboard container, you’ll containerize an interactive web application and learn about port mapping to access apps running inside containers.

—

Troubleshooting: If you’re stuck, check that:

Docker Desktop is running
You’re in the 01-data-cleaner/ directory
The data directories were created correctly