This repository contains the full project used in the Amazon ETL with Apache Airflow and Docker tutorial series. It demonstrates a complete end-to-end ETL workflow using Apache Airflow, Docker, and MySQL, from local development all the way to production-like ready deployment patterns.
amazon-docker-tutorial/— The primary project folder containing the full Airflow setup used in Part One and Part Two of the tutorial.
You’ll find three main subdirectories:
part-one/— The complete DAGs and configuration for the first part of the tutorial.part-two/— Git-based DAG syncing, CI validation, and lightweight deployment automation.amazon-etl/— The Amazon scraping + transformation code used for the real-world ETL example.
Inside amazon-docker-tutorial/ you’ll find:
docker-compose.yaml– A fully configured Airflow environment (api-server, scheduler, triggerer, DAG processor, metadata DB, logs).
The project walks through:
- Running Airflow inside Docker Compose
- Scraping real Amazon book data
- Transforming messy HTML output into clean analytics data
- Loading data into MySQL
- Syncing DAGs from GitHub using
git-sync - Lightweight CI for validating DAGs on every push
Clone the repository:
git clone git@github.com:dataquestio/tutorials.git
cd amazon-docker-tutorialCreate required folders:
mkdir -p ./dags ./logs ./plugins ./configInitialize Airflow:
docker compose up airflow-initStart all services:
docker compose up -dAccess the Airflow UI:
http://localhost:8080
Credentials:
Username: airflow
Password: airflow
A real-world ETL pipeline that:
- Extracts book listings from Amazon (title, author, price, rating)
- Transforms the raw HTML data into numeric fields
- Loads the cleaned dataset into a MySQL table
- Runs automatically on a daily schedule
Files are saved to /opt/airflow/tmp/ inside the container.
You’ll also learn how to:
- Sync DAGs automatically from GitHub into Airflow using git-sync
- Validate DAG syntax on every push using a lightweight GitHub Actions workflow
These are production-like patterns for managing Airflow safely and collaboratively.
Stop all running containers:
docker compose downReset the environment completely (including the metadata database):
docker compose down -v- Explore the main DAG inside
dags/amazon_books_etl.py - Modify the extraction to use different Amazon categories or other sites
- Try connecting Airflow to cloud services (RDS, S3, ECS)
- Continue to the cloud deployment tutorial to run Airflow on Amazon ECR (Fargate)