E2E Real-Time Data Pipeline
Real-time data pipeline with Kafka, Flink, Iceberg, Trino, and Superset.
π OverviewThis project demonstrates a real-time end-to-end (E2E) data pipeline designed to handle clickstream data. It shows how to ingest, process, store, query, and visualize streaming data using o...

About this project
π Overview
This project demonstrates a real-time end-to-end (E2E) data pipeline designed to handle clickstream data. It shows how to ingest, process, store, query, and visualize streaming data using open-source tools, all containerized with Docker for easy deployment.
π Technologies Used:
Data Ingestion: Apache Kafka
Stream Processing: Apache Flink
Object Storage: MinIO (S3-compatible)
Data Lake Table Format: Apache Iceberg
Query Engine: Trino
Visualization: Apache Superset
Flow
Clickstream Data Generator simulates real-time user events and pushes them to Kafka topic.
Apache Flink processes Kafka streams and writes clean data to Iceberg tables stored on MinIO.
Trino connects to Iceberg for querying the processed data.
Apache Superset visualizes the data by connecting to Trino.
π Key Features
π Real-Time Data Processing
Stream processing with Apache Flink.
Clickstream events are transformed and filtered in real-time.
π Modern Data Lakehouse
Data is stored in Apache Iceberg on MinIO, S3 compatible, supporting schema evolution and time travel.
β‘ Fast SQL Analytics
Trino provides fast, distributed SQL queries on Iceberg data.
π Interactive Dashboards
Apache Superset delivers real-time visual analytics.
π¦ Fully Containerized Setup
Simplified deployment using Docker and Docker Compose for seamless integration across all services.
You must be logged in to comment
Sign in to commentComments
No comments yet
Be the first to share your thoughts!