Design a Real-Time Messaging System

hard22 minBackend System Design
Key Topics
real time messagingwebsocketscassandrakafkaredischatdistributed systems

How to Design a Real-Time Messaging System

Designing a real-time messaging system is one of the most consistently asked questions across the entire system design interview landscape. Meta asks it for WhatsApp and Messenger. Google asks it for Chat. Amazon asks it. Microsoft asks it for Teams. Slack asks a version of it about their own platform. Discord asks it. It's everywhere — and for good reason.

On the surface it sounds like a simple problem. User A sends a message. User B receives it. But the moment you start asking how, the complexity floods in. How does a message sent to a server in Virginia reach a user connected to a server in Frankfurt in under 200ms? What happens if the recipient is offline? What if 500 people are in the same group chat? How do you guarantee a message is never lost — but also never delivered twice? How do you show who's online without querying a database for every user every second?

These are the questions that turn a straightforward-sounding problem into a genuine distributed systems challenge. This guide covers all of them, in the kind of back-and-forth you'd actually have in the interview room.


Step 1: Clarify the Scope

Interviewer: Design a real-time messaging system.

Candidate: Before I start — a few questions. Are we designing 1:1 messaging only, or group chats too? What's the scale — WhatsApp-scale with billions of users, or something more like a company internal tool with millions? Do we need media support — images, videos, documents — or text only to start? Do we need message delivery receipts (sent, delivered, read)? What about online/offline presence indicators? And is end-to-end encryption a hard requirement?

Interviewer: 1:1 and group chats up to 500 members. WhatsApp-scale — assume 2 billion registered users, 500 million daily active users. Text and media. Yes to delivery receipts and presence. End-to-end encryption is a nice-to-have — mention it but don't let it dominate the design.

Candidate: Perfect. That gives me a clear scope. Let me work through requirements and numbers, then walk through the architecture from the connection layer to the database.


Requirements

Functional

  • 1:1 messaging between any two users
  • Group messaging with up to 500 members per group
  • Message delivery receipts: sent (✓), delivered (✓✓), read (✓✓ blue)
  • Push notifications for offline users (mobile and desktop)
  • Online/offline presence with "last seen" timestamps
  • Message history — users can scroll back through conversation history
  • Media messages — images, video, documents
  • Multi-device sync — a user's messages are consistent across phone, tablet, and desktop

Non-Functional

  • Low latency — message delivery under 500ms for online users
  • High availability — the system must tolerate server failures without data loss
  • Durability — a sent message must never be permanently lost
  • At-least-once delivery with client-side deduplication (no message is silently dropped)
  • Eventual consistency for ordering — messages in a conversation appear in the same order for all participants

Back-of-the-Envelope Estimates

Interviewer: Walk me through the numbers.

Candidate: Let me work through it.

plaintext
Registered users:           2 billion
Daily Active Users (DAU):   500 million
Messages per DAU per day:   40 (mix of sending and receiving)
Total messages per day:     500M × 40 = 20 billion messages/day
 
Messages per second (average):
  20B / 86,400s ≈ 230,000 messages/second
 
Messages per second (peak, 3× average):
  ~700,000 messages/second
 
Average message size (text): 100 bytes
Storage per day (text):     20B × 100 bytes = ~2 TB/day
  With replication (3×):    ~6 TB/day
  Per year:                 ~2.2 PB
 
Concurrent WebSocket connections:
  Assume 20% of DAU online simultaneously
  = 100 million concurrent connections
 
  At 50,000 connections per chat server:
  = ~2,000 chat servers needed
 
Bandwidth:
  Incoming: 230K × 100B = ~23 MB/sec
  Outgoing: Higher due to fan-out (group chats, multi-device)
  Assume 5× fan-out on average: ~115 MB/sec outbound

Two things stand out from these numbers. First, 100 million concurrent WebSocket connections is a stateful problem — you can't just horizontally scale with round-robin load balancing and ignore which server a user is on. How you route a message to the right connection is the central architectural challenge. Second, 6 TB of storage per day with replication is significant but well within Cassandra's capabilities — the real challenge is write throughput at 230,000 messages per second, not raw capacity.


High-Level Architecture

plaintext
                     ┌──────────────────────────────┐
                     │   Clients                     │
                     │   (Web, iOS, Android)         │
                     └──────────┬───────────────────┘
                                │ WebSocket or HTTPS
                     ┌──────────▼───────────────────┐
                     │   Load Balancer               │
                     │   (L4 — sticky sessions       │
                     │    for WS, L7 for REST)       │
                     └──────────┬───────────────────┘

         ┌──────────────────────┼───────────────────────┐
         │                      │                       │
┌────────▼────────┐   ┌─────────▼───────┐   ┌──────────▼───────┐
│  Chat Server 1  │   │  Chat Server 2  │   │  Chat Server N   │
│  (stateful:     │   │  (stateful)     │   │  (stateful)      │
│   holds WS      │   │                 │   │                  │
│   connections)  │   │                 │   │                  │
└────────┬────────┘   └─────────┬───────┘   └──────────┬───────┘
         │                      │                       │
         └──────────────────────┼───────────────────────┘
                                │ publish / subscribe
                     ┌──────────▼───────────────────┐
                     │   Redis Pub/Sub Cluster       │
                     │   (cross-server message       │
                     │    routing)                   │
                     └──────────┬───────────────────┘

              ┌─────────────────┼──────────────────┐
              │                 │                  │
   ┌──────────▼──────┐  ┌───────▼──────┐  ┌───────▼──────────┐
   │   Message       │  │   Session    │  │   Notification   │
   │   Service       │  │   Service    │  │   Service        │
   │ (persist,       │  │ (who's on    │  │ (APNs, FCM,      │
   │  deliver)       │  │  which srv)  │  │  email)          │
   └──────────┬──────┘  └──────────────┘  └──────────────────┘

   ┌──────────▼──────────────────────────┐
   │         Data Layer                  │
   │  Cassandra (messages)               │
   │  PostgreSQL (users, groups)         │
   │  Redis (sessions, presence, cache)  │
   │  S3/GCS (media files)               │
   └─────────────────────────────────────┘

The Core Challenge: Routing Messages Across Servers

This is the question most candidates don't answer clearly, and it's the first thing an experienced interviewer will probe.

What it is: each chat server maintains persistent WebSocket connections with a subset of users. A user connected to Server 1 sends a message to a user connected to Server 7. Server 1 can't directly push to Server 7's connection — it doesn't know about it. How does the message get there?

Interviewer: Alice is connected to Chat Server 1 in Virginia. Bob is connected to Chat Server 5 in Frankfurt. Alice sends Bob a message. Walk me through exactly what happens.

Candidate: Here's the full sequence:

plaintext
1. Alice's client sends message via WebSocket to Chat Server 1
   Payload: { clientMsgId: "uuid-abc", to: "bob", content: "Hey!" }
 
2. Chat Server 1 → Message Service:
   a. Generate a Snowflake ID for the message (globally unique, time-ordered)
   b. Persist the message to Cassandra immediately
      → If the server crashes after this point, the message is safe
   c. ACK back to Alice's client: "Message received" (the single-tick ✓)
 
3. Message Service looks up Bob's location in the Session Service:
   Redis: GET session:bob → "Chat Server 5"
 
4. Message Service publishes to Redis Pub/Sub:
   Channel: "user:bob", Payload: { messageId, content, from: "alice", ... }
 
5. Chat Server 5 is subscribed to "user:bob"
   → It receives the published message
   → Pushes it to Bob's WebSocket connection
 
6. Bob's client receives the message and sends a DELIVERED ACK
   → Chat Server 5 → Message Service → updates message status in Cassandra
   → Message Service publishes DELIVERED event to Alice's channel
   → Chat Server 1 pushes the double-tick ✓✓ to Alice's client
 
7. Bob opens the conversation (READ event):
   → Similar path: Bob's client sends READ ACK
   → Message Service updates status
   → Blue double-tick pushed to Alice

The key design principle here: persist first, then route. If the server crashes between step 2b and step 3, the message is already in Cassandra. When Bob reconnects, the Message Service delivers all undelivered messages. Nothing is lost.

Interviewer: Why Redis Pub/Sub specifically, rather than Kafka or direct server-to-server calls?

Candidate: Direct server-to-server calls would require every chat server to maintain a full mesh of connections to every other server — at 2,000 servers, that's 4 million connections. That doesn't scale.

Kafka is a great option for durable, at-least-once delivery with replay. But for the real-time delivery path, Kafka's latency is higher than Redis Pub/Sub — we're talking milliseconds vs. sub-millisecond for Redis. For a system where sub-500ms delivery is the goal, that matters.

Redis Pub/Sub is the right tool for this specific pattern: fan a message from one publisher to one subscriber (the recipient's chat server) with minimal latency. The durability concern is handled by Cassandra — if Redis drops a message (which it can, being fire-and-forget), the recipient gets it on reconnect from Cassandra. Redis handles the real-time path; Cassandra handles durability.

The cross-server routing question — "Alice is on Server 1, Bob is on Server 7, what happens?" — is almost guaranteed to come up at Meta, Slack, and Discord. Getting the full sequence fluent (persist → session lookup → Pub/Sub publish → push to connection → ACK back) takes a few dry runs out loud. Mockingly.ai has messaging system design simulations where this routing sequence is a standard opening probe.


WebSocket Connection Management

What WebSockets are: WebSockets provide a persistent, full-duplex TCP connection between client and server. Once established via an HTTP upgrade handshake, either side can send data at any time without the overhead of a new HTTP request. They're the foundation of any real-time messaging system.

Interviewer: Why WebSockets instead of long polling or SSE?

Candidate: Long polling is essentially a hack — the client sends an HTTP request and the server holds it open until data is available, then the client immediately sends another. The overhead of repeated HTTP connection setup at 100 million concurrent users is enormous, and message delivery latency is limited by the polling interval. Server-Sent Events are one-directional — the server can push, but the client must use separate HTTP requests to send. For a bidirectional messaging system, WebSockets are the right tool.

Interviewer: How do you load balance WebSocket connections?

Candidate: WebSocket connections are stateful and persistent — they stay connected to one server for the duration of the session. The load balancer only matters at connection time; after that, the connection is pinned to one server.

We use an L4 (TCP-level) load balancer with connection distribution at handshake time. Once connected, the client stays on that server. The Session Service in Redis records this mapping: session:{user_id} → chat_server_id, so any other component can find where a user is connected.

When a chat server dies, all its WebSocket connections drop. Clients detect this via the heartbeat timeout (we send periodic pings; if the pong doesn't come back, the connection is dead). They automatically reconnect — the load balancer routes them to a live server, and they re-register with the Session Service.

Interviewer: What's the heartbeat interval?

Candidate: Typically 30 seconds. This balances two concerns: too frequent and we're burning battery on mobile clients with unnecessary keep-alive traffic; too infrequent and we take too long to detect a dead connection. 30 seconds is a standard industry default. On mobile, platforms like iOS aggressively suspend background apps — so for native mobile clients, we rely more on FCM/APNs push notifications than persistent WebSocket connections when the app is backgrounded.


Database Design: Cassandra for Messages

This is where most candidates give a vague answer. Interviewers want to see schema design and reasoning.

Why Cassandra and not PostgreSQL for messages:

PostgreSQL is excellent for structured relational data — users, groups, permissions. But for messages, the access pattern is: write billions of rows per day, read them back in time-ordered batches per conversation. PostgreSQL's B-tree indexes were not designed for this. Every write to a high-volume table triggers B-tree updates that become increasingly expensive at scale. Cassandra's LSM-tree (Log-Structured Merge tree) storage is append-only — writes are sequential and blazingly fast.

Interviewer: Show me the Cassandra schema for messages.

Candidate: Here's the core message table:

sql
CREATE TABLE messages (
    chat_id     TEXT,          -- conversation identifier (hashed user IDs for 1:1)
    message_id  BIGINT,        -- Snowflake ID: globally unique, time-ordered
    sender_id   TEXT,
    content     TEXT,          -- null for media messages
    media_url   TEXT,          -- S3/GCS URL, null for text
    msg_type    TEXT,          -- 'text', 'image', 'video', 'document'
    status      TEXT,          -- 'sent', 'delivered', 'read'
    created_at  TIMESTAMP,
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

The partition key is chat_id. All messages in a conversation are stored together on the same Cassandra node. Reading a conversation's history is a single partition scan — no cross-node joins needed, which is exactly the access pattern Cassandra was built for.

The clustering key is message_id in descending order, so the most recent messages come back first without an additional sort step.

Interviewer: Why message_id as a Snowflake ID instead of a timestamp?

Candidate: Two reasons. First, timestamp collisions. Even at millisecond granularity, in a high-volume system with multiple servers, two messages can have the same timestamp. In Cassandra, the primary key must be unique — a collision would silently overwrite a message. Second, Snowflake IDs are globally unique UUIDs that encode a timestamp as the most significant bits, making them naturally time-sortable without separate sort logic. Discord made exactly this choice for their Cassandra schema.

A Snowflake ID is a 64-bit integer composed of:

plaintext
| 41 bits (ms timestamp) | 10 bits (machine ID) | 12 bits (sequence) |

This gives us ~4,096 unique IDs per millisecond per machine — no collisions, naturally ordered.

Interviewer: What's the partition key for a 1:1 conversation between Alice and Bob?

Candidate: A deterministic hash of both user IDs, ensuring Alice's side and Bob's side always resolve to the same conversation: chat_id = hash(sort([alice_id, bob_id])). Sorting the IDs before hashing ensures the same result regardless of which user initiates. For group chats, chat_id is the group's UUID.

Relational Data: PostgreSQL

User profiles, group memberships, friend relationships, and access control — this is relational data that benefits from ACID transactions and complex queries. PostgreSQL handles this well.

sql
-- Core tables (abbreviated for clarity)
users (user_id UUID PK, phone TEXT UNIQUE, display_name TEXT, last_seen TIMESTAMP)
conversations (conv_id UUID PK, type TEXT, created_at TIMESTAMP)
conv_members (conv_id UUID, user_id UUID, joined_at TIMESTAMP, last_read_msg_id BIGINT,
              PRIMARY KEY (conv_id, user_id))

last_read_msg_id on conv_members is how we compute unread counts: count messages where message_id > last_read_msg_id. No separate unread counter to update transactionally.

The Cassandra schema — why chat_id as partition key, why Snowflake over timestamp, how the 1:1 chat_id is derived — is one of the most probed sections in messaging system design interviews at Discord and Meta. The schema isn't hard once you understand the access pattern reasoning, but explaining why each decision was made under follow-up questioning is where preparation pays off. Mockingly.ai includes messaging schema follow-ups in its simulation library.


Message Delivery Guarantees

Interviewer: How do you guarantee messages are never lost, never delivered twice?

Candidate: We target at-least-once delivery with client-side idempotency — the same approach WhatsApp uses.

Never lost: the message is written to Cassandra before acknowledgement is sent to the sender. If the server crashes after the write but before delivery, the message is safe in Cassandra. When the recipient reconnects, the system delivers all messages where message_id > last_delivered_message_id.

At-least-once: we might deliver a message more than once in failure scenarios — for example, if the delivery confirmation is lost and we retry. This is acceptable; duplicates are better than lost messages.

Client-side deduplication: every message carries a client-generated client_msg_id UUID. The receiving client deduplicates on this ID before displaying. If the same message arrives twice, it's shown once. The client_msg_id is also what lets us handle the send-retry case — if Alice's client doesn't receive an ACK within 5 seconds, it retries the send with the same client_msg_id. The server checks if it's already stored this ID (Redis cache of recent client IDs) and returns the existing result rather than creating a duplicate.

Message ordering: Snowflake IDs give us global time-ordering. Messages within a conversation are always displayed in Snowflake ID order, not in receive order. This means if Alice sends two messages quickly and the second arrives first due to network jitter, the UI still shows them in the correct order.


Group Messaging: The Fan-Out Problem

Interviewer: Alice sends a message to a group with 500 members. Walk me through what happens.

Candidate: Group messaging is where the architecture gets genuinely harder. A single message write needs to be delivered to potentially 500 different connections on potentially 500 different chat servers.

plaintext
1. Alice's client → Chat Server (via WebSocket)
2. Message Service:
   a. Persist message ONCE to Cassandra
      (chat_id = group_id, one row regardless of member count)
   b. ACK back to Alice → single tick ✓
3. Fan-out:
   a. Fetch all group member IDs from PostgreSQL (conv_members)
   b. For each member:
      - Look up their session in Redis: "Server X"
      - If online: publish to Redis channel user:{member_id}
      - If offline: enqueue for push notification
4. Each chat server receives its users' messages via Redis subscription
   → Pushes to individual WebSocket connections

Interviewer: For a 500-member group, that's 500 Redis publishes and 500 potential push notifications per message. If this group is active, how does the system not get overwhelmed?

Candidate: A few mitigations. First, we fan out asynchronously — the message is ACK'd to Alice immediately after Cassandra write. The fan-out happens in a background worker pool, so Alice's send latency is not affected by group size.

Second, for large groups we use Kafka as an intermediate fan-out queue. The Message Service publishes one event to Kafka with the group_id and message_id. A pool of Fan-Out Workers consume from Kafka, fetch member lists, and do the Redis publishes and push notifications in parallel. This decouples fan-out throughput from message ingestion throughput.

Third, for very large groups (say, broadcast channels with 10,000+ members), we switch strategies. Instead of fan-out-on-write (pushing to each member), we use fan-out-on-read: store one copy of the message, and have clients poll on open. This is the approach Telegram uses for very large channels.

Interviewer: What's the threshold between the two approaches?

Candidate: The exact threshold is configurable and depends on infrastructure cost vs. latency trade-off. A common heuristic: groups under 500 members use fan-out-on-write for low read latency; groups over 500 use fan-out-on-read. In practice, few groups actually reach 500 members, so the vast majority of groups use the simpler push model.


Presence System: Online/Offline Status

Presence seems like a small feature. At WhatsApp scale, it's a distributed systems problem.

What it needs to do: show each user whether their contacts are online, and when they were last seen. Updated in near-real-time.

Interviewer: How do you implement the presence system without hammering your database?

Candidate: Heartbeats and Redis TTLs. When a user is connected, their client sends a heartbeat ping every 30 seconds. The chat server updates a Redis key:

plaintext
SET presence:{user_id} "online" EX 60
SET last_seen:{user_id} {timestamp}

If no heartbeat arrives within 60 seconds (two missed heartbeats), the TTL expires and the key disappears. A query for presence:{user_id} returns nil — the user is offline. The last_seen key is updated on each heartbeat and persists (no TTL).

Interviewer: What happens when Alice opens a conversation with Bob? Does her client subscribe to Bob's presence?

Candidate: Yes, but selectively. The client subscribes to presence updates only for contacts currently visible on screen — not the full contact list. If Alice has 300 contacts, we're not fanning out presence changes to all 300 simultaneously. When she opens a conversation with Bob, her client tells the server "I want presence for Bob." The server subscribes her to presence:bob events. When she closes the conversation, she unsubscribes.

The naive approach — broadcasting every presence change to all contacts — doesn't scale. If Alice has 300 contacts and each of them changes status every few minutes, Alice's connection receives 300 updates per minute doing nothing useful. Selective presence subscription keeps presence traffic proportional to what's actually visible on screen.

Presence design is a section where interviewers specifically probe "what if the user has 10,000 contacts?" to see if you've thought through the fan-out cost. The selective subscription answer — only subscribe to contacts currently on screen — is the right one, but delivering it crisply under that follow-up takes practice. Mockingly.ai surfaces exactly that kind of scaling follow-up in its simulations.


Offline Message Delivery

Interviewer: Bob's phone has been off for two days. He turns it on. How does he get his missed messages?

Candidate: Two paths.

Push notifications (FCM/APNs): when a message is sent to Bob while he's offline, the Notification Service sends a push notification via FCM (Android) or APNs (iOS). The push carries a lightweight payload — "You have N new messages from Alice" — not the message content itself. The notification wakes Bob's device and prompts him to open the app.

On-reconnect sync: when Bob's client reconnects and establishes a WebSocket connection, it sends its last_synced_message_id to the server. The Message Service queries Cassandra:

sql
SELECT * FROM messages
WHERE chat_id = :chat_id
  AND message_id > :last_synced_message_id
ORDER BY message_id ASC
LIMIT 50;

This runs for each conversation Bob participates in. The results are streamed back to the client. Bob's UI shows the missed messages in order.

For efficiency, we don't query every conversation on reconnect — only the ones with unread messages. The conv_members.last_read_msg_id field tells us whether a conversation has anything new.

Interviewer: What if Bob was offline for 30 days and has thousands of unread messages across hundreds of conversations?

Candidate: We paginate aggressively — fetch the last 50 messages per conversation first, load more on scroll. We also prioritise conversations with the most recent activity. The user doesn't need all 30 days loaded at once; they need the newest content immediately. Cold-loading the full history from Cassandra is deferred until the user explicitly scrolls back.


Multi-Device Sync

Interviewer: Alice has an iPhone, an iPad, and a MacBook. She reads a message on her iPhone. How do her other devices know?

Candidate: Multi-device is handled as a special case of fan-out. Each device is registered with a unique device ID, and the Session Service maps each device to its chat server connection. When Alice reads a message on her iPhone:

plaintext
iPhone client sends READ ACK { conv_id, message_id, device_id: "iphone" }
  → Message Service updates status in Cassandra
  → Publishes "read" event to all of Alice's devices:
      Publishes to channels: user:alice:ipad, user:alice:macbook
  → iPad and MacBook connections receive the event
  → Their UIs update the conversation as "read"

All of Alice's devices maintain separate WebSocket connections. The fan-out is to all her devices, not just one. The client-side deduplication on client_msg_id ensures that a message Alice sends on her iPhone doesn't appear twice when it syncs to her MacBook.

Interviewer: What if two of Alice's devices are both open and she sends a message from each simultaneously?

Candidate: Each message has a unique client_msg_id generated on-device, and the Snowflake ID assigned server-side provides canonical ordering. Both messages are persisted to Cassandra in the order they arrived. Both devices receive both messages via fan-out and display them in Snowflake ID order. There's no conflict — they're independent messages, even if sent within milliseconds of each other.


Media Messages

Interviewer: How do you handle image and video messages?

Candidate: Never route media through the chat server. The same pre-signed URL pattern used in any large-scale media system.

plaintext
1. Alice's client requests a pre-signed upload URL from the Media Service
   → Media Service returns: { uploadUrl (S3 pre-signed), mediaId }
2. Alice's client uploads directly to S3 (bypasses chat servers entirely)
3. Alice's client sends a message via WebSocket:
   { type: "image", mediaId: "s3-ref-xyz", thumbnail: "base64...", clientMsgId: "..." }
4. Server persists the message with media_url pointing to S3/CDN
5. Recipient receives the message and loads the image from CDN, not from chat server

The CDN serves media globally with low latency. Chat servers only handle small metadata messages, never binary payloads. A single 10MB video upload that routes through a chat server could block thousands of lightweight text messages; keeping media out-of-band prevents this.


End-to-End Encryption (Brief)

Interviewer: How would you add end-to-end encryption?

Candidate: The Signal Protocol is the industry standard — used by WhatsApp, Signal, and Google Messages. The key architectural implication is that the server becomes a dumb pipe: it routes ciphertext but never sees plaintext. Messages are encrypted on the sender's device using the recipient's public key and decrypted only on the recipient's device.

For the system design, the changes are: each device generates a key pair on registration. The server stores only public keys. The message content field in Cassandra stores opaque bytes (ciphertext). Server-side features like full-text search and content moderation become impossible, which is a real product trade-off. For a security-focused product, that's acceptable. For a business platform like Slack where admins need eDiscovery, E2EE is incompatible without significant additional complexity.


Scaling and Horizontal Architecture

Interviewer: How do you scale this to handle 100 million concurrent WebSocket connections?

Candidate: The chat server tier is the stateful component that doesn't scale with standard horizontal scaling. Here's the approach.

Chat servers: each handles 50,000 connections (a practical limit considering memory and file descriptors per server). At 100 million connections, we need ~2,000 chat servers. The load balancer distributes new connections evenly. Servers are added or removed without affecting existing connections until the connection is dropped.

Redis Pub/Sub cluster: sharded by user_id. All subscribe/publish operations for a given user always hit the same Redis shard. Consistent hashing distributes users across shards.

Message Service: stateless. Scales horizontally with no coordination. Each instance connects to Cassandra and Redis independently.

Cassandra cluster: partitioned by chat_id. With consistent hashing, conversations are distributed across nodes. Adding nodes redistributes load automatically.

The numbers: at 2,000 chat servers × 50,000 connections = 100M concurrent connections. Each server handles ~350 messages/sec at average load (700K/sec ÷ 2000 servers). Easily within a single server's capacity.


Common Interview Follow-ups

"How do you handle message ordering when two people in a group send simultaneously?"

Each message gets a Snowflake ID assigned by the server at write time. Snowflake IDs are time-ordered at millisecond granularity. All clients display messages sorted by Snowflake ID. There's no perfect global ordering for truly simultaneous messages — two messages within the same millisecond are ordered arbitrarily but consistently (the sequence number within the Snowflake ID breaks the tie). The same ordering is used by all clients, so everyone sees the same conversation sequence.

"What happens if the Redis Pub/Sub message is dropped before the recipient's server receives it?"

Redis Pub/Sub is fire-and-forget — it doesn't guarantee delivery. This is acceptable because Cassandra is the ground truth. When a recipient connects (or reconnects), they sync all messages with message_id > last_synced_id. Redis is the fast path for real-time delivery; Cassandra is the fallback. Any message not delivered via Redis is caught on the next sync. The "delivered" status isn't set until the client explicitly ACKs — so if a message is missed via Redis, it's not marked delivered until it's actually received.

"How do you handle a chat server crashing with 50,000 active connections?"

Clients detect the connection drop via heartbeat timeout (typically within 30–60 seconds). They reconnect to a new server via the load balancer and re-register in the Session Service. Messages sent to the old server's Pub/Sub channels are lost — but those messages were already persisted to Cassandra. On reconnect, the client syncs with its last_synced_id and recovers any missed messages. The worst case is a brief interruption and a slight delay in message delivery.

"How would you design read receipts for a 500-member group?"

Individual read receipts for 500 members per message creates a huge amount of data: 500 members × millions of messages = billions of receipt records. Most group chat apps simplify this. Store only a count on the message (delivered to N of 500, read by M of 500). Full "who exactly read it" is only fetched on demand when a user specifically taps to see it — a separate, lower-frequency API call. WhatsApp shows "read by 47 of 500" rather than individual avatars in the chat view for exactly this reason.

"What database would you use if you needed full-text search across message history?"

Cassandra can't do full-text search — it's a key-value store with limited query flexibility. For search, index messages into Elasticsearch asynchronously. When a message is written to Cassandra, publish an event to Kafka; an Elasticsearch writer worker consumes it and indexes the message. Search queries go to Elasticsearch, returning message IDs, which are then fetched from Cassandra or served from the search result directly. The eventual-consistency gap between write and index (seconds to minutes) is acceptable for search.


Quick Interview Checklist

  • ✅ Clarified scope — 1:1 and group, scale, receipts, presence, media
  • ✅ Back-of-the-envelope — 230K messages/second, 100M concurrent connections, 2,000 chat servers
  • ✅ WebSockets justified over long polling and SSE — bidirectional, persistent, low overhead
  • ✅ L4 load balancer for sticky WebSocket connections
  • ✅ Session Service in Redis — maps user_id to chat server
  • ✅ Cross-server routing via Redis Pub/Sub — fast path, not Kafka
  • ✅ Persist-first architecture — Cassandra write before delivery, recovery on reconnect
  • ✅ Cassandra schema — partition by chat_id, cluster by Snowflake message_id DESC
  • ✅ Snowflake IDs over timestamps — collision-free, time-ordered
  • ✅ PostgreSQL for users, groups, memberships — relational data needs ACID
  • ✅ At-least-once delivery with client-side client_msg_id deduplication
  • ✅ Group fan-out — async worker pool, Kafka for large groups, fan-out-on-read threshold
  • ✅ Presence — Redis TTL heartbeat, selective subscription for visible contacts only
  • ✅ Offline delivery — FCM/APNs push + reconnect sync with last_synced_message_id
  • ✅ Multi-device sync — fan-out to all of a user's devices, Snowflake ordering
  • ✅ Media — pre-signed S3 URL, never through chat servers, CDN delivery
  • ✅ E2E encryption — Signal Protocol, server stores ciphertext, trade-offs acknowledged
  • ✅ Horizontal scaling — 2,000 chat servers, sharded Redis, Cassandra consistent hashing

Conclusion

Designing a real-time messaging system is the canonical distributed systems interview question because it touches almost every major challenge in the field simultaneously: stateful connections that don't scale like stateless services, message ordering in a distributed environment, delivery guarantees under network failures, fan-out at enormous scale, and presence tracking without database hammering.

The candidates who do best at Meta, Google, Amazon, and Slack aren't the ones who memorise the correct answer. They're the ones who reason clearly about why each decision is made: why Redis for routing instead of Kafka (latency), why Cassandra for messages instead of PostgreSQL (write throughput and horizontal scale), why persist-first before delivery (durability), why selective presence subscription instead of broadcasting to all contacts (fan-out cost).

The design pillars:

  1. Stateful WebSocket gateway, stateless everything else — chat servers hold connections; message and session services are stateless and scale independently
  2. Session Service in Redis — the single source of truth for which user is on which server
  3. Redis Pub/Sub for real-time routing, Cassandra for durability — fast path and ground truth serve different purposes
  4. Cassandra partitioned by chat_id, Snowflake message_id — all messages in a conversation co-located, no collision risk, natural time ordering
  5. Persist-first, then route — a message is durable before it's delivered; recovery on reconnect catches anything the real-time path missed
  6. At-least-once + client dedup — better to receive a message twice than to never receive it
  7. Async fan-out via Kafka for groups — decouples message ingestion latency from group size


Frequently Asked Questions

How does a message get from one chat server to another in real time?

Redis Pub/Sub routes messages between chat servers. Each chat server subscribes to a channel for each user it's hosting. When a message arrives for a user on a different server, the sending server publishes to that user's channel.

The full sequence:

  1. Alice (on Server 1) sends a message to Bob
  2. Server 1 writes the message to Cassandra — durability first
  3. Server 1 ACKs to Alice — the single-tick ✓
  4. Server 1 queries the Session Service: GET session:bob → "Server 7"
  5. Server 1 publishes to Redis: PUBLISH user:bob {message_payload}
  6. Server 7 is subscribed to user:bob — receives the event
  7. Server 7 pushes the message to Bob's WebSocket connection
  8. Bob's client sends a DELIVERED ACK — the double-tick ✓✓

Why Redis Pub/Sub and not Kafka for this path:

  1. Kafka's latency is in the milliseconds range — acceptable for async processing, too slow for sub-500ms real-time delivery
  2. Redis Pub/Sub delivers in sub-millisecond — purpose-built for this fan pattern
  3. Durability is already handled by Cassandra (step 2) — Redis doesn't need to be durable here

Why not direct server-to-server calls:

At 2,000 chat servers, a full mesh of direct connections would require 4 million TCP connections. Not scalable.


Why is Cassandra used for message storage instead of PostgreSQL?

Cassandra's LSM-tree storage engine is designed for the exact write pattern that messaging generates — append-only inserts at high throughput, queried by partition in time order. PostgreSQL's B-tree storage degrades under sustained high-volume writes.

The messaging write pattern:

  1. 230,000 new message rows per second — all inserts, never updates
  2. Each insert goes to the "end" of a conversation partition — sequential writes
  3. Primary read: "give me the last 50 messages in conversation X" — a partition range scan

Why PostgreSQL fails for this at scale:

  1. Every insert updates a B-tree index — at 230K/second, index contention becomes a bottleneck
  2. A busy conversations table means frequent page splits and fragmentation
  3. At 2 TB of new messages per day, PostgreSQL vacuums and checkpoint I/O become significant

Why Cassandra fits:

  1. LSM-tree writes are sequential appends to a memtable — no index updates at write time
  2. chat_id as partition key collocates all messages in a conversation on one node
  3. message_id DESC clustering order means the most recent messages come back first with no sort
  4. Horizontal scaling via consistent hashing — add nodes, redistribute automatically

What PostgreSQL still handles: user profiles, group memberships, friend relationships — structured relational data with ACID requirements. The two databases serve different access patterns.


What is the Cassandra schema for storing chat messages?

The message table partitions by chat_id (conversation identifier) and clusters by message_id in descending order, so the most recent messages are returned first without sorting.

sql
CREATE TABLE messages (
    chat_id     TEXT,      -- partition key: conversation ID
    message_id  BIGINT,    -- clustering key: Snowflake ID (time-ordered)
    sender_id   TEXT,
    content     TEXT,
    media_url   TEXT,
    msg_type    TEXT,
    status      TEXT,
    created_at  TIMESTAMP,
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Three schema decisions worth explaining in an interview:

  1. chat_id as partition key — collocates all messages in a conversation on one Cassandra node. "Fetch conversation history" is a single-node range scan, not a distributed query
  2. Snowflake ID, not a timestamp — timestamps have millisecond collisions in high-volume systems; Cassandra's primary key must be unique. Snowflake IDs are globally unique and time-sortable
  3. message_id DESC clustering — returns most recent messages first without an ORDER BY, reducing read latency

How the chat_id is derived for 1:1 conversations:

plaintext
chat_id = hash(sort([user_a_id, user_b_id]))

Sorting the IDs before hashing ensures the same chat_id regardless of which user initiates — Alice-Bob and Bob-Alice produce the same conversation.


What is the "persist-first" architecture and why does it matter?

Persist-first means writing a message to durable storage (Cassandra) before attempting delivery to the recipient and before acknowledging to the sender. It guarantees no message is ever lost, even if every server involved crashes immediately after.

The failure scenario that makes it necessary:

  1. Message arrives at Server 1
  2. Server 1 tries to route to Server 7 (where Bob is)
  3. Server 1 crashes before writing to Cassandra
  4. Result: the message is gone — neither Bob nor Cassandra has it

With persist-first:

  1. Message arrives at Server 1
  2. Server 1 writes to Cassandra → durable
  3. Server 1 ACKs to Alice (single tick ✓) — she knows the message is safe
  4. Server 1 attempts delivery to Bob
  5. Server 1 crashes — the message is already in Cassandra
  6. When Bob reconnects, he queries Cassandra for messages with message_id > last_synced_id and receives the message

The ACK timing matters: Alice receives the single-tick acknowledgement after the Cassandra write but before Bob receives the message. This is the correct semantics — "sent" means "the system has it", not "Bob received it."


How does the group chat fan-out work for 500 members?

Group fan-out writes one message to Cassandra, then delivers it asynchronously to each group member via Redis Pub/Sub (online members) or push notification (offline members).

The flow:

  1. Alice sends a message to a 500-member group
  2. Message Service writes one row to Cassandra — chat_id = group_id
  3. ACK to Alice immediately — fan-out does not block the send latency
  4. Async fan-out (via Kafka workers):
    • Fetch all 500 member IDs from PostgreSQL
    • For each member: check Redis session → if online, publish to user:{member_id}; if offline, enqueue push notification
  5. Each chat server delivers to its connected members

Why Kafka for group fan-out, not direct Redis publish:

  1. Publishing 500 Redis events synchronously would add ~50ms to Alice's send path — unacceptable
  2. Kafka absorbs the burst: Alice's message triggers one Kafka event; a pool of Fan-Out Workers process the 500 Redis publishes in parallel
  3. Fan-out throughput scales independently of message ingestion throughput

Fan-out-on-read for very large groups (10,000+ members):

For extremely large groups (broadcast channels), skip the per-member push entirely. Store one message copy; clients fetch on open. This is how Telegram handles large channels — fan-out cost is zero; slightly higher read latency per client.


How does the presence system work at WhatsApp scale?

Presence uses Redis TTLs driven by client heartbeats. A user is "online" if their presence key exists in Redis; "offline" if it has expired.

How it works:

  1. Connected clients send a heartbeat ping every 30 seconds
  2. On each heartbeat: SET presence:{user_id} online EX 60 (60-second TTL)
  3. A user is online if GET presence:{user_id} returns a value
  4. If no heartbeat arrives within 60 seconds, the key expires → user appears offline
  5. SET last_seen:{user_id} {timestamp} is updated on each heartbeat, no TTL — persists as "last seen"

Why selective subscription instead of broadcasting:

At WhatsApp scale, broadcasting every user's presence change to all their contacts would generate enormous traffic. If Alice has 300 contacts and each changes status every few minutes, that's thousands of presence events per minute for Alice alone.

Instead:

  1. Clients subscribe to presence updates only for contacts currently visible on screen
  2. When Alice opens a conversation with Bob: subscribe to presence:bob
  3. When Alice navigates away: unsubscribe from presence:bob

Presence traffic scales proportionally to what's actually displayed, not to total contact count.


How are offline messages delivered when a user reconnects?

Offline message delivery uses two paths: push notifications to wake the device, and on-reconnect sync to fetch all missed messages.

Push notifications (while device is offline):

  1. Message arrives for Bob while he's offline
  2. Notification Service sends via FCM (Android) or APNs (iOS)
  3. Payload: "You have N new messages from Alice" — not the message content
  4. Push notification wakes the device and prompts app open

On-reconnect sync:

When Bob's client reconnects and establishes a WebSocket:

  1. Client sends last_synced_message_id for each conversation
  2. Message Service queries Cassandra:
    sql
    SELECT * FROM messages
    WHERE chat_id = :chat_id
      AND message_id > :last_synced_message_id
    ORDER BY message_id ASC
    LIMIT 50;
  3. Results streamed back to client for each conversation with undelivered messages

For a user offline 30+ days:

  1. Paginate aggressively — fetch the last 50 messages per conversation first
  2. Prioritise conversations with most recent activity
  3. Load older history lazily as the user scrolls — Cassandra range scans are fast and bounded

How does multi-device sync work in a messaging system?

Multi-device sync treats each device as a separate recipient in the fan-out. Every device the user owns maintains an independent WebSocket connection and is registered separately in the Session Service.

How it works:

  1. Each device has a unique device_id stored alongside the user's session
  2. When a message arrives for Alice: fan-out delivers to all of Alice's active devices
    • user:alice:iphone channel → iPhone connection
    • user:alice:ipad channel → iPad connection
    • user:alice:macbook channel → MacBook connection
  3. Each device maintains its own last_synced_message_id
  4. Offline devices sync on reconnect using the per-device last-synced pointer

Read receipts across devices:

When Alice reads a conversation on her iPhone:

  1. iPhone sends a READ ACK with the message ID and device ID
  2. Message Service publishes a "read" event to Alice's other devices
  3. iPad and MacBook mark the conversation as read — no action needed from those devices

Deduplication for sent messages:

If Alice sends a message from her iPhone and it syncs to her MacBook, the MacBook receives it as a fan-out event (not as a "you sent this" — just a new message in the conversation). The client_msg_id deduplicates it if it somehow arrives twice.


Which companies ask the real-time messaging system design question?

Meta, Google, Amazon, Microsoft, Slack, Discord, and Telegram ask variants of this question for senior software engineer roles.

Why it is the most commonly asked hard system design question:

  1. Touches every distributed systems concept — stateful WebSocket management, cross-server routing, message ordering, delivery guarantees, fan-out at scale, and presence tracking all appear in one question
  2. Scales to seniority — a mid-level answer describes "use WebSockets and a database"; a senior answer covers persist-first architecture, why Redis Pub/Sub beats Kafka for routing latency, Cassandra partition key reasoning, and selective presence subscription
  3. Directly maps to real products — every company on this list either builds or operates a real-time messaging system

What interviewers specifically listen for:

  1. Persist-first, then route — the message is in Cassandra before ACKing to the sender
  2. Redis Pub/Sub for routing, not Kafka — with the latency reasoning (sub-millisecond vs milliseconds)
  3. chat_id as Cassandra partition key — and why it collocates conversation messages on one node
  4. Snowflake ID over timestamp — and the specific collision failure mode timestamps create
  5. Selective presence subscription — not broadcasting to all contacts, and the fan-out cost reasoning

If any of those five feel uncertain when explaining them live, Mockingly.ai runs real-time messaging system design simulations — with follow-ups on group fan-out, Redis reliability, and Cassandra schema — built for engineers targeting roles at Meta, Google, Amazon, Slack, and Discord.


Designing a messaging system under 45 minutes of interview pressure — while explaining trade-offs, drawing components, and fielding follow-ups on group fan-out and Redis Pub/Sub reliability — is a genuinely different skill from understanding it on paper. If you want to close that gap before your actual interview, Mockingly.ai has backend system design simulations built for engineers preparing for senior roles at Meta, Google, Amazon, Slack, Discord, and beyond.

Companies That Ask This

Ready to Practice?

You've read the guide — now put your knowledge to the test. Our AI interviewer will challenge you with follow-up questions and give you real-time feedback on your system design.

Free tier includes unlimited practice with AI feedback • No credit card required

Related System Design Guides