Building a Real-Time E-Commerce Pipeline with Kafka & ClickHouse

LK Rajat
Jun 22, 2025
4 min read

Updated: Jun 23, 2025

What if you could trace every click, every cart addition, and every order live across your entire platform? In today’s data-driven world, streaming analytics isn’t just a “nice to have”- it’s table stakes. In this post, I’ll reveal how I built a full end-to-end e-commerce order pipeline using Apache Kafka for real-time event streaming and ClickHouse for sub-second analytical queries. You’ll learn how to simulate order lifecycles, funnel them through Kafka, and land them in ClickHouse - unlocking insights as orders happen, not hours later.

You can check out the project on GitHub here: https://github.com/rajatmohan22/Ecommerce-Piepleline

Overview

I built a minimal e-commerce simulation to demonstrate:

Event generation (order lifecycle statuses) via Kafka
Stream consumption and storage in ClickHouse
Real-time analytics on orders

This pipeline can be extended to track inventory, payments, user behavior, and more.

🔍 1. Why This Architecture?

Decoupling & Scalability: Kafka decouples producers (your application emitting order events) from consumers (your analytics layer), letting each scale independently.
High-Velocity Ingestion: ClickHouse’s MergeTree engine can ingest millions of rows per second, making it ideal for time-series and event data.
Real-Time Visibility: By streaming orders into ClickHouse as they occur, dashboards and alerts can reflect live purchase behavior.

Flow of events

[sendEvent.js] ──► Kafka Topic “orders” ──► [consumeAndInsert.js] ──► ClickHouse ──► Analytics

Producer (sendEvent.js)
- Simulates an order lifecycle: OrderPlaced → PaymentProcessed → Packed → Shipped → Delivered
- Emits each status as a JSON message to Kafka
Kafka Cluster
- Acts as a durable, distributed buffer
- Decouples producers from consumers
- Guarantees ordered, at-least-once delivery of events
Consumer (consumeAndInsert.js)
- Listens for new messages on the orders topic
- Parses JSON payloads
- Inserts into ClickHouse via JSONEachRow for batch-style high throughput
ClickHouse
- Columnar storage engine optimized for OLAP
- MergeTree engine for high-speed inserts and fast queries on time-series data
- Supports SQL syntax; instantly available over HTTP

🛠️ Tech Stack

Component	Technology	Role
Event Broker	Apache Kafka	Buffer and distribute order events
Producer	Node.js + kafkajs	Simulate & publish order-lifecycle messages to Kafka
Consumer	Node.js + @clickhouse/client	Read Kafka events & insert into ClickHouse
Analytics DB	ClickHouse	Store, index, and query order data at high speed
Orchestration	Docker & Docker Compose	Stand up Kafka, ZooKeeper, and ClickHouse in one command

Project Structure

Ecommerce-Pipeline/

├── clickhouse-init/

│ └── orders.sql ← Schema initialization for ClickHouse

├── docker-compose.yaml ← Docker: Kafka, ZooKeeper, ClickHouse

├── sendEvent.js ← Kafka producer: simulates order flow

├── consumeAndInsert.js ← Kafka consumer: writes to ClickHouse

└── README.md ← Project overview & setup instructions

ClickHouse Initialization

In the clickhouse-init/ directory we include a simple SQL script: orders.sql that bootstraps our analytics schema on container startup:

CREATE DATABASE IF NOT EXISTS ecommerce;
CREATE TABLE IF NOT EXISTS ecommerce.orders (
	id UUID,
	user_id String,
	status String,
	created_at DateTime
	) ENGINE = MergeTree
ORDER BY (created_at);

When Docker Compose brings up the ClickHouse service, this script runs automatically, ensuring our table is ready to receive streaming data.

2. Bringing Up the Stack

With your schema in place, start the entire pipeline in one command:

docker-compose up -d

ZooKeeper and Kafka form the backbone of our event bus.
ClickHouse launches with the ecommerce.orders table pre-created.
All containers are networked together so that producers, consumers, and ClickHouse communicate seamlessly.

3. Simulating Real Order Traffic

Next, run the producer to emit a realistic order lifecycle:

node sendEvent.js

You’ll see logs like:

Kafka Connected. 
Order status sent: OrderPlaced 
Order status sent: PaymentProcessed
Order status sent: Delivered
....

Each status transition is a separate Kafka message, complete with timestamp and payload fields.

4. Streaming into ClickHouse

In a separate shell, fire up the consumer:

node consumeAndInsert.js

The consumer subscribes to the orders topic, parses each JSON message, and performs a JSONEachRow insert into ClickHouse:

Consumer connected to Kafka Received order:

{"id":"…","user_id":"user_123","status":"Packed",…}

Order inserted into ClickHouse: {"id":"…","user_id":"user_123",…}

Thanks to ClickHouse’s high-throughput MergeTree engine, even thousands of events per second can flow through without dropping a beat.

Querying in Near–Real Time

Now that data is streaming in, open a ClickHouse client or UI and run ad-hoc SQL:

-- Count orders by status in the last hour


SELECT 
	status, 
	count(*) AS total 
	FROM ecommerce.orders 
	WHERE created_at > now() - INTERVAL 1 HOUR 
	GROUP BY status;

-- Average time from OrderPlaced to Delivered

SELECT 

avg(toUnixTimestampIfNull(delivered_at, now())				toUnixTimestampIfNull(placed_at, now())) AS avg_delivery_seconds FROM ( SELECT anyIf(created_at, status = 'OrderPlaced') AS placed_at, anyIf(created_at, status = 'Delivered') AS delivered_at FROM ecommerce.orders GROUP BY id );

Both of these return results in milliseconds, powering dashboards that update the moment new events arrive.

Extending & Optimizing

Materialized Views: Pre-aggregate hot metrics (e.g., per-minute order counts) into dedicated tables for ultra-fast dashboards.
Partitioning by Date: Add PARTITION BY toYYYYMM(created_at) to the table definition to accelerate large-range scans.
Schema Evolution: Add columns like item_count or total_amount and backfill historical data via Kafka “replay” for richer analytics.
Real-Time Enrichment: Integrate with Kafka Streams or ksqlDB to join order events with user profiles or inventory data before ingestion.

Why ClickHouse for Streaming Analytics?

Sub-Second Aggregations: Columnar reads and vectorized execution mean even complex GROUP BY queries finish in under a second.
Massive Write Scale: MergeTree can sustain millions of inserts per second, matching Kafka’s throughput for true end-to-end streaming.
Familiar SQL Interface: Analysts and data scientists dive straight in with standard SQLno proprietary DSL or ETL frameworks required.
Cost EfficiencyHigh compression ratios and CPU-efficient execution keep infrastructure costs low compared to traditional OLAP systems.

Conclusion

By coupling Kafka’s battle-tested event streaming with ClickHouse’s blazing-fast analytics engine, you can build a real-time e-commerce pipeline capable of handling production-scale workloads. Whether you’re monitoring order health, detecting fraud, or driving live dashboards, this architecture delivers the immediacy and scale modern applications demand.