Batch data pipeline

Batch Data Pipeline with Airflow, DuckDB, Delta Lake, Trino and Metabase. Observability and quality.

β€’
Apache AirflowΒ·
DuckDBΒ·
Delta LakeΒ·
TrinoΒ·
Prometheus

This project demonstrates a batch data pipeline following the Medallion Architecture (Bronze β†’ Silver β†’ Gold). It showcases how to ingest, clean, validate, aggregate, and visualize sales data using mo...

Screenshot 1

About this project

This project demonstrates a batch data pipeline following the Medallion Architecture (Bronze β†’ Silver β†’ Gold). It showcases how to ingest, clean, validate, aggregate, and visualize sales data using modern data engineering tools, all containerized with Docker for easy deployment.

πŸ”Ž Technologies Used:

  • Orchestration: Apache Airflow

  • Data Processing: DuckDB

  • Data Lake Format: Delta Lake

  • Object Storage: MinIO (S3-compatible)

  • Query Engine: Trino

  • Visualization: Metabase

  • Data Quality: Soda Core

  • Lineage: OpenLineage + Marquez

  • Observability: Prometheus + Grafana

Data Flow:

  1. Data Generator simulates realistic sales transactions with intentional data quality issues (~20% dirty data)

  2. Bronze Layer ingests raw JSON data to MinIO without validation

  3. Silver Layer cleans, validates, and deduplicates data into Delta Lake tables

  4. Quality Checks validates Silver data using Soda Core with 15+ checks

  5. Gold Layer creates business aggregations (daily sales, product performance, customer segments)

  6. Trino provides SQL interface for querying Delta Lake tables

  7. Metabase visualizes business metrics through interactive dashboards

  8. Marquez tracks data lineage across the entire pipeline

  9. Grafana monitors pipeline health and performance metrics

πŸ† Key Features

βœ… Medallion Architecture

  • Bronze β†’ Silver β†’ Gold layers

  • Progressive data quality improvement

  • Clear separation of concerns

βœ… Data Quality First

  • 15+ Soda Core validation rules

  • Automatic dirty data generation for testing

  • Quality metrics in Grafana

βœ… ACID Transactions

  • Delta Lake for Silver/Gold layers

  • Time travel support

  • Schema evolution

βœ… Idempotent Operations

  • Safe to re-run any task

  • Deduplication in Silver layer

  • Upsert operations in Gold layer

βœ… Parallel Processing

  • Gold aggregations run concurrently

  • ThreadPoolExecutor with 3 workers

βœ… Full Observability

  • Custom Prometheus metrics

  • Grafana dashboards

  • Data lineage with Marquez

  • OpenLineage integration

βœ… Best Practices

  • Error handling & retries

  • Comprehensive logging

  • Health checks for services

  • Docker containerization

Stack:
Apache AirflowDuckDBDelta LakeTrinoPrometheus
Team

You must be logged in to comment

Sign in to comment

Comments

No comments yet

Be the first to share your thoughts!

Project Info

Published on Nov 25, 2025
View on GitHub