top of page
Search

Building a Real-Time E-Commerce Pipeline with Kafka & ClickHouse

  • Writer: LK Rajat
    LK Rajat
  • Jun 22, 2025
  • 4 min read

Updated: Jun 23, 2025

What if you could trace every click, every cart addition, and every order live across your entire platform? In today’s data-driven world, streaming analytics isn’t just a “nice to have”- it’s table stakes. In this post, I’ll reveal how I built a full end-to-end e-commerce order pipeline using Apache Kafka for real-time event streaming and ClickHouse for sub-second analytical queries. You’ll learn how to simulate order lifecycles, funnel them through Kafka, and land them in ClickHouse - unlocking insights as orders happen, not hours later.

You can check out the project on GitHub here: https://github.com/rajatmohan22/Ecommerce-Piepleline


Overview


I built a minimal e-commerce simulation to demonstrate:

  • Event generation (order lifecycle statuses) via Kafka

  • Stream consumption and storage in ClickHouse

  • Real-time analytics on orders

This pipeline can be extended to track inventory, payments, user behavior, and more.



🔍 1. Why This Architecture?


  • Decoupling & Scalability: Kafka decouples producers (your application emitting order events) from consumers (your analytics layer), letting each scale independently.

  • High-Velocity Ingestion: ClickHouse’s MergeTree engine can ingest millions of rows per second, making it ideal for time-series and event data.

  • Real-Time Visibility: By streaming orders into ClickHouse as they occur, dashboards and alerts can reflect live purchase behavior.


Flow of events


[sendEvent.js] ──► Kafka Topic “orders” ──► [consumeAndInsert.js] ──► ClickHouse ──► Analytics

  1. Producer (sendEvent.js)

    • Simulates an order lifecycle: OrderPlaced → PaymentProcessed → Packed → Shipped → Delivered

    • Emits each status as a JSON message to Kafka

  2. Kafka Cluster

    • Acts as a durable, distributed buffer

    • Decouples producers from consumers

    • Guarantees ordered, at-least-once delivery of events

  3. Consumer (consumeAndInsert.js)

    • Listens for new messages on the orders topic

    • Parses JSON payloads

    • Inserts into ClickHouse via JSONEachRow for batch-style high throughput

  4. ClickHouse

    • Columnar storage engine optimized for OLAP

    • MergeTree engine for high-speed inserts and fast queries on time-series data

    • Supports SQL syntax; instantly available over HTTP


🛠️ Tech Stack


Component

Technology

Role

Event Broker

Apache Kafka

Buffer and distribute order events

Producer

Node.js + kafkajs

Simulate & publish order-lifecycle messages to Kafka

Consumer

Node.js + @clickhouse/client

Read Kafka events & insert into ClickHouse

Analytics DB

ClickHouse

Store, index, and query order data at high speed

Orchestration

Docker & Docker Compose

Stand up Kafka, ZooKeeper, and ClickHouse in one command

Project Structure


Ecommerce-Pipeline/

├── clickhouse-init/

│ └── orders.sql ← Schema initialization for ClickHouse

├── docker-compose.yaml ← Docker: Kafka, ZooKeeper, ClickHouse

├── sendEvent.js ← Kafka producer: simulates order flow

├── consumeAndInsert.js ← Kafka consumer: writes to ClickHouse

└── README.md ← Project overview & setup instructions


ClickHouse Initialization


In the clickhouse-init/ directory we include a simple SQL script: orders.sql that bootstraps our analytics schema on container startup:


CREATE DATABASE IF NOT EXISTS ecommerce;
CREATE TABLE IF NOT EXISTS ecommerce.orders (
	id UUID,
	user_id String,
	status String,
	created_at DateTime
	) ENGINE = MergeTree
ORDER BY (created_at);

When Docker Compose brings up the ClickHouse service, this script runs automatically, ensuring our table is ready to receive streaming data.


2. Bringing Up the Stack

With your schema in place, start the entire pipeline in one command:


docker-compose up -d
  • ZooKeeper and Kafka form the backbone of our event bus.

  • ClickHouse launches with the ecommerce.orders table pre-created.

  • All containers are networked together so that producers, consumers, and ClickHouse communicate seamlessly.


3. Simulating Real Order Traffic

Next, run the producer to emit a realistic order lifecycle:


node sendEvent.js

You’ll see logs like:

Kafka Connected. 
Order status sent: OrderPlaced 
Order status sent: PaymentProcessed
Order status sent: Delivered
....

Each status transition is a separate Kafka message, complete with timestamp and payload fields.



4. Streaming into ClickHouse

In a separate shell, fire up the consumer:


node consumeAndInsert.js

The consumer subscribes to the orders topic, parses each JSON message, and performs a JSONEachRow insert into ClickHouse:


Consumer connected to Kafka Received order:


{"id":"…","user_id":"user_123","status":"Packed",…}


Order inserted into ClickHouse: {"id":"…","user_id":"user_123",…}


Thanks to ClickHouse’s high-throughput MergeTree engine, even thousands of events per second can flow through without dropping a beat.


Querying in Near–Real Time

Now that data is streaming in, open a ClickHouse client or UI and run ad-hoc SQL:


-- Count orders by status in the last hour


SELECT 
	status, 
	count(*) AS total 
	FROM ecommerce.orders 
	WHERE created_at > now() - INTERVAL 1 HOUR 
	GROUP BY status;

-- Average time from OrderPlaced to Delivered

SELECT 

avg(toUnixTimestampIfNull(delivered_at, now())				toUnixTimestampIfNull(placed_at, now())) AS avg_delivery_seconds FROM ( SELECT anyIf(created_at, status = 'OrderPlaced') AS placed_at, anyIf(created_at, status = 'Delivered') AS delivered_at FROM ecommerce.orders GROUP BY id );

Both of these return results in milliseconds, powering dashboards that update the moment new events arrive.


 Extending & Optimizing


  • Materialized Views: Pre-aggregate hot metrics (e.g., per-minute order counts) into dedicated tables for ultra-fast dashboards.

  • Partitioning by Date: Add PARTITION BY toYYYYMM(created_at) to the table definition to accelerate large-range scans.

  • Schema Evolution: Add columns like item_count or total_amount and backfill historical data via Kafka “replay” for richer analytics.

  • Real-Time Enrichment: Integrate with Kafka Streams or ksqlDB to join order events with user profiles or inventory data before ingestion.


Why ClickHouse for Streaming Analytics?


  1. Sub-Second Aggregations: Columnar reads and vectorized execution mean even complex GROUP BY queries finish in under a second.

  2. Massive Write Scale: MergeTree can sustain millions of inserts per second, matching Kafka’s throughput for true end-to-end streaming.

  3. Familiar SQL Interface: Analysts and data scientists dive straight in with standard SQLno proprietary DSL or ETL frameworks required.

  4. Cost EfficiencyHigh compression ratios and CPU-efficient execution keep infrastructure costs low compared to traditional OLAP systems.


Conclusion


By coupling Kafka’s battle-tested event streaming with ClickHouse’s blazing-fast analytics engine, you can build a real-time e-commerce pipeline capable of handling production-scale workloads. Whether you’re monitoring order health, detecting fraud, or driving live dashboards, this architecture delivers the immediacy and scale modern applications demand.

 
 
 

Recent Posts

See All

Comments


bottom of page