Building a Real-Time E-Commerce Pipeline with Kafka & ClickHouse
- LK Rajat
- Jun 22, 2025
- 4 min read
Updated: Jun 23, 2025
What if you could trace every click, every cart addition, and every order live across your entire platform? In today’s data-driven world, streaming analytics isn’t just a “nice to have”- it’s table stakes. In this post, I’ll reveal how I built a full end-to-end e-commerce order pipeline using Apache Kafka for real-time event streaming and ClickHouse for sub-second analytical queries. You’ll learn how to simulate order lifecycles, funnel them through Kafka, and land them in ClickHouse - unlocking insights as orders happen, not hours later.
You can check out the project on GitHub here: https://github.com/rajatmohan22/Ecommerce-Piepleline
Overview
I built a minimal e-commerce simulation to demonstrate:
Event generation (order lifecycle statuses) via Kafka
Stream consumption and storage in ClickHouse
Real-time analytics on orders
This pipeline can be extended to track inventory, payments, user behavior, and more.
🔍 1. Why This Architecture?
Decoupling & Scalability: Kafka decouples producers (your application emitting order events) from consumers (your analytics layer), letting each scale independently.
High-Velocity Ingestion: ClickHouse’s MergeTree engine can ingest millions of rows per second, making it ideal for time-series and event data.
Real-Time Visibility: By streaming orders into ClickHouse as they occur, dashboards and alerts can reflect live purchase behavior.
Flow of events
[sendEvent.js] ──► Kafka Topic “orders” ──► [consumeAndInsert.js] ──► ClickHouse ──► AnalyticsProducer (sendEvent.js)
Simulates an order lifecycle: OrderPlaced → PaymentProcessed → Packed → Shipped → Delivered
Emits each status as a JSON message to Kafka
Kafka Cluster
Acts as a durable, distributed buffer
Decouples producers from consumers
Guarantees ordered, at-least-once delivery of events
Consumer (consumeAndInsert.js)
Listens for new messages on the orders topic
Parses JSON payloads
Inserts into ClickHouse via JSONEachRow for batch-style high throughput
ClickHouse
Columnar storage engine optimized for OLAP
MergeTree engine for high-speed inserts and fast queries on time-series data
Supports SQL syntax; instantly available over HTTP
🛠️ Tech Stack
Component | Technology | Role |
Event Broker | Apache Kafka | Buffer and distribute order events |
Producer | Node.js + kafkajs | Simulate & publish order-lifecycle messages to Kafka |
Consumer | Node.js + @clickhouse/client | Read Kafka events & insert into ClickHouse |
Analytics DB | ClickHouse | Store, index, and query order data at high speed |
Orchestration | Docker & Docker Compose | Stand up Kafka, ZooKeeper, and ClickHouse in one command |
Project Structure
Ecommerce-Pipeline/
├── clickhouse-init/
│ └── orders.sql ← Schema initialization for ClickHouse
├── docker-compose.yaml ← Docker: Kafka, ZooKeeper, ClickHouse
├── sendEvent.js ← Kafka producer: simulates order flow
├── consumeAndInsert.js ← Kafka consumer: writes to ClickHouse
└── README.md ← Project overview & setup instructions
ClickHouse Initialization
In the clickhouse-init/ directory we include a simple SQL script: orders.sql that bootstraps our analytics schema on container startup:
CREATE DATABASE IF NOT EXISTS ecommerce;
CREATE TABLE IF NOT EXISTS ecommerce.orders (
id UUID,
user_id String,
status String,
created_at DateTime
) ENGINE = MergeTree
ORDER BY (created_at);When Docker Compose brings up the ClickHouse service, this script runs automatically, ensuring our table is ready to receive streaming data.
2. Bringing Up the Stack
With your schema in place, start the entire pipeline in one command:
docker-compose up -dZooKeeper and Kafka form the backbone of our event bus.
ClickHouse launches with the ecommerce.orders table pre-created.
All containers are networked together so that producers, consumers, and ClickHouse communicate seamlessly.
3. Simulating Real Order Traffic
Next, run the producer to emit a realistic order lifecycle:
node sendEvent.jsYou’ll see logs like:
Kafka Connected.
Order status sent: OrderPlaced
Order status sent: PaymentProcessed
Order status sent: Delivered
....Each status transition is a separate Kafka message, complete with timestamp and payload fields.
4. Streaming into ClickHouse
In a separate shell, fire up the consumer:
node consumeAndInsert.jsThe consumer subscribes to the orders topic, parses each JSON message, and performs a JSONEachRow insert into ClickHouse:
Consumer connected to Kafka Received order:
{"id":"…","user_id":"user_123","status":"Packed",…}
Order inserted into ClickHouse: {"id":"…","user_id":"user_123",…}
Thanks to ClickHouse’s high-throughput MergeTree engine, even thousands of events per second can flow through without dropping a beat.
Querying in Near–Real Time
Now that data is streaming in, open a ClickHouse client or UI and run ad-hoc SQL:
-- Count orders by status in the last hour
SELECT
status,
count(*) AS total
FROM ecommerce.orders
WHERE created_at > now() - INTERVAL 1 HOUR
GROUP BY status;-- Average time from OrderPlaced to Delivered
SELECT
avg(toUnixTimestampIfNull(delivered_at, now()) toUnixTimestampIfNull(placed_at, now())) AS avg_delivery_seconds FROM ( SELECT anyIf(created_at, status = 'OrderPlaced') AS placed_at, anyIf(created_at, status = 'Delivered') AS delivered_at FROM ecommerce.orders GROUP BY id );Both of these return results in milliseconds, powering dashboards that update the moment new events arrive.
Extending & Optimizing
Materialized Views: Pre-aggregate hot metrics (e.g., per-minute order counts) into dedicated tables for ultra-fast dashboards.
Partitioning by Date: Add PARTITION BY toYYYYMM(created_at) to the table definition to accelerate large-range scans.
Schema Evolution: Add columns like item_count or total_amount and backfill historical data via Kafka “replay” for richer analytics.
Real-Time Enrichment: Integrate with Kafka Streams or ksqlDB to join order events with user profiles or inventory data before ingestion.
Why ClickHouse for Streaming Analytics?
Sub-Second Aggregations: Columnar reads and vectorized execution mean even complex GROUP BY queries finish in under a second.
Massive Write Scale: MergeTree can sustain millions of inserts per second, matching Kafka’s throughput for true end-to-end streaming.
Familiar SQL Interface: Analysts and data scientists dive straight in with standard SQLno proprietary DSL or ETL frameworks required.
Cost EfficiencyHigh compression ratios and CPU-efficient execution keep infrastructure costs low compared to traditional OLAP systems.
Conclusion
By coupling Kafka’s battle-tested event streaming with ClickHouse’s blazing-fast analytics engine, you can build a real-time e-commerce pipeline capable of handling production-scale workloads. Whether you’re monitoring order health, detecting fraud, or driving live dashboards, this architecture delivers the immediacy and scale modern applications demand.

Comments