Lab 1: Data cleaner container
Learning objectives
Understand the basic structure of a Dockerfile
Build your first Docker image
Run a containerized Python application
Use volume mounts to share data between host and container
See how containers enable modular ML pipeline components
What’s included
clean_data.py: A pandas-based data cleaning scriptDockerfile: Instructions to build the container imagesample_data.csv: Sample dataset with duplicates and missing values.dockerignore: Files to exclude from the Docker build context
The scenario
Imagine you’re building a production ML pipeline. The data cleaning step is containerized as a separate component that can:
Run independently of other pipeline stages
Scale horizontally to process multiple datasets in parallel
Be updated without affecting training or serving containers
Run consistently across development, staging, and production environments
Step-by-step instructions
1. Examine the Dockerfile
Navigate to the example directory and open the Dockerfile:
cd 01-data-cleaner
Notice the structure:
FROM python:3.11-slim # Start with Python base image
WORKDIR /app # Set working directory
RUN pip install pandas # Install dependencies
COPY clean_data.py . # Copy our script
CMD ["python", "clean_data.py"] # Default command
Each instruction creates a layer in the image. Docker caches layers, so rebuilds are fast when only later layers change.
2. Build the Docker image
From the 01-data-cleaner directory, run:
docker build -t data-cleaner .
What’s happening:
-t data-cleaner: Tags the image with the name “data-cleaner”.: Build context is the current directory
Watch Docker execute each Dockerfile instruction and create layers.
3. Verify the image
List your Docker images:
docker images | grep data-cleaner
You should see your newly created image with its size and creation time.
4. Prepare data directory
Create directories for input and output:
mkdir -p data/input data/output
cp sample_data.csv data/input/raw_data.csv
mkdir -p data/input, data/output
cp sample_data.csv data/input/raw_data.csv
mkdir data\input data\output
copy sample_data.csv data\input\raw_data.csv
5. Run the container
docker run --rm -v "$(pwd)/data:/data" data-cleaner
docker run --rm -v "${PWD}/data:/data" data-cleaner
docker run --rm -v "%cd%/data:/data" data-cleaner
What’s happening:
--rm: Automatically remove the container when it exits-v: Mount your localdata/directory to/datain the container (syntax varies by OS/shell)data-cleaner: The image to run
6. Examine the output
The script displays:
Original dataset statistics
Missing values found
Cleaning operations performed
Cleaned dataset summary
Check the cleaned file:
cat data/output/cleaned_data.csv
type data\output\cleaned_data.csv
Notice:
Duplicates removed (Alice and Bob appeared twice)
Rows with missing values removed (David had no age, Eve had no score)
Whitespace trimmed (Frank’s city “ Austin “ became “Austin”)
Key concepts
Dockerfile basics: FROM, WORKDIR, RUN, COPY, CMD
Image building: Creating reusable templates for containers
Volume mounts: Sharing data between host and container
Container isolation: The script runs in its own environment with only pandas installed
Modularity: This component could be part of a larger pipeline, processing data before training
Experiment further
Try these modifications:
Change the cleaning logic: Edit
clean_data.pyto fill missing values instead of dropping themRebuild and rerun: Notice Docker’s layer caching makes rebuilds fast
Process different data: Replace
sample_data.csvwith your own CSV fileCheck container cleanup: Run
docker ps -ato verify--rmremoved the container
What’s next?
In Lab 2: Streamlit dashboard container, you’ll containerize an interactive web application and learn about port mapping to access apps running inside containers.
—
Troubleshooting: If you’re stuck, check that:
Docker Desktop is running
You’re in the
01-data-cleaner/directoryThe data directories were created correctly