Meta Interview Question

Design a Distributed Metrics Logging and Aggregation System — Meta Interview

hard20 minBackend System Design

How Meta Tests This

Meta (Facebook) interviews focus heavily on social graph systems, real-time messaging, content delivery at scale, and feed ranking algorithms. Their system design rounds test your ability to design products used by billions of people daily.

Interview focus: Social feeds, messaging systems, content delivery, real-time features, and collaborative tools.

Key Topics
metricsloggingobservabilitytime serieskafkaprometheusinfluxdb

How to Design a Distributed Metrics, Logging and Aggregation System

This question shows up everywhere — Google, Amazon, Meta, LinkedIn, Stripe, and especially at observability-focused companies like Datadog. When it does, it tends to fool candidates who haven't prepared for it specifically.

The problem looks deceptively familiar. It's "just" a write-heavy data pipeline, right? Collect some numbers, store them, query them. But once you get into the details — push vs. pull collection, the cardinality explosion that breaks naive storage, the challenge of firing an alert at exactly the right time without thousands of false positives, the cost of storing raw 10-second data for two years — you're in much harder territory.

This guide covers the full design, including the parts that most articles skip: why time-series databases exist and how they store data differently from a regular database, what cardinality actually means and why it's the silent killer in metrics systems, how downsampling keeps storage costs from exploding, and what the alerting pipeline actually looks like end-to-end.


Clarify the Scope First

This question is broad enough that two people can answer it completely differently and both be right. Spend two minutes figuring out exactly what you're building.

Good questions to ask:

  • Are we designing primarily for metrics, logs, or both? They have different access patterns and storage needs.
  • Is this an internal tool for engineers and SREs, or a multi-tenant platform sold to external customers?
  • What scale? A company running 500 servers, or a platform monitoring 100,000 customer environments like Datadog?
  • Do we need real-time alerting, or is query-based analysis the primary use case?
  • What's the expected data retention — days, months, or years?
  • Do we need dashboards and visualisation, or just storage and querying?

For this guide: an internal observability platform serving a large-scale company with thousands of services. It needs to collect infrastructure and application metrics, store them efficiently, support querying over time ranges, and fire alerts when thresholds are crossed. Log storage is in scope but treated as a separate storage path.


Distinguishing Metrics from Logs

Before requirements, get this distinction clear. Metrics and logs are both observability data, but they're fundamentally different — and conflating them leads to a muddled design.

Metrics are numerical measurements sampled over time: CPU usage at 72%, HTTP request count at 1,440 per second, memory used at 4.2GB. They're small, regular, and high-frequency. A server might emit 200 metrics every 15 seconds. The interesting thing about metrics is their aggregated behaviour over time — the trend, the spike, the moving average.

Logs are structured or unstructured text events: "User 12345 logged in from IP 203.0.113.1 at 14:23:07." They're variable-size, irregular, and need full-text search. A single request might emit several log lines. The interesting thing about a log is its content — the error message, the stack trace, the request ID.

The storage and query patterns are completely different. A metrics system answers "what was the average CPU usage on this host group over the last hour?" A log system answers "show me all error logs from service X containing the string 'timeout' in the last 30 minutes."

Designing both in one answer is fine — but keep the storage paths separate and explain why.


Requirements

Functional

  • Collect metrics from thousands of hosts and services (CPU, memory, disk, custom app metrics)
  • Store metrics with millisecond or second-level granularity
  • Query metrics over arbitrary time ranges with aggregation (sum, avg, max, p99)
  • Collect, store, and search log data from all services
  • Fire alerts when metric values cross configured thresholds
  • Expose a dashboard/visualisation layer for engineers

Non-Functional

  • High write throughput — metrics arrive constantly from thousands of sources; the system must absorb spikes
  • Low query latency for recent data — dashboards should load in under a second for the last hour
  • Long retention with controlled cost — old data should be retained but at reduced resolution
  • High availability — if the metrics system goes down, it should fail gracefully; engineers need their dashboards
  • Fault tolerance — losing a single node should not cause data loss

Back-of-the-Envelope Estimates

Let's work through the numbers.

plaintext
Scale assumptions:
  Monitored hosts/services:    10,000
  Metrics per host:            200 (CPU, memory, disk, network, app metrics)
  Collection interval:         10 seconds
 
Write throughput:
  10,000 hosts × 200 metrics / 10s = 200,000 data points per second
  Each data point: ~50 bytes (metric name + labels + timestamp + value)
  Ingestion rate: 200,000 × 50 bytes = ~10 MB/second
 
Daily storage (raw):
  200,000 points/sec × 86,400s × 50 bytes ≈ 864 GB/day
  After compression (time-series data compresses ~10:1): ~86 GB/day
  One year of raw data: ~31 TB
 
Log storage:
  10,000 services × 1 KB average log line × 100 lines/sec average
  = ~1 GB/second raw log throughput
  One day of logs: ~86 TB (compressed: ~20–30 TB)
 
Query throughput:
  Assuming 500 engineers running dashboards and alert checks
  Read QPS: ~5,000 queries/second (mix of recent and historical)

The numbers reveal two important things. First, the metrics write path is intense but manageable with the right storage engine. Second, log storage at raw retention would be catastrophically expensive — tiering and retention policies are essential, not optional.


High-Level Architecture

plaintext
┌────────────────────────────────────────────────────────────┐
│              Data Sources                                   │
│   Servers / Containers / Applications / Databases          │
│   (instrumented with agents: Telegraf / Prometheus         │
│    exporters / custom SDK)                                  │
└───────────────────────┬────────────────────────────────────┘
                        │  push or pull
┌───────────────────────▼────────────────────────────────────┐
│              Collection Layer                               │
│   Metric Collectors (Prometheus scrapers / StatsD)         │
│   Log Shippers (Fluentd / Fluent Bit / Logstash)           │
└───────────────────────┬────────────────────────────────────┘

┌───────────────────────▼────────────────────────────────────┐
│              Ingestion Buffer                               │
│              Apache Kafka                                   │
│   Topic: raw-metrics   Topic: raw-logs                     │
└─────────────┬──────────────────────────┬───────────────────┘
              │                          │
┌─────────────▼──────────┐   ┌──────────▼──────────────────┐
│   Metrics Processor     │   │   Log Processor              │
│   (stream aggregation,  │   │   (parsing, structuring,     │
│    downsampling rules,  │   │    indexing)                 │
│    label enrichment)    │   │                              │
└─────────────┬──────────┘   └──────────┬──────────────────┘
              │                          │
┌─────────────▼──────────┐   ┌──────────▼──────────────────┐
│   Time-Series DB        │   │   Log Storage                │
│   (InfluxDB / Prometheus│   │   (Elasticsearch)            │
│    / VictoriaMetrics)   │   │                              │
│   Hot tier: NVMe SSD    │   │   Hot: recent 7 days         │
│   Warm tier: HDD        │   │   Cold: S3 + Glacier         │
│   Cold tier: S3         │   │                              │
└─────────────┬──────────┘   └──────────┬──────────────────┘
              │                          │
┌─────────────▼─────────────────────────▼──────────────────┐
│              Query Layer                                   │
│   Query Service (caching, routing, access control)        │
└─────────────┬──────────────────────────────────────────  ┘

┌─────────────▼─────────────────────────────────────────────┐
│              Consuming Services                            │
│   Alert Engine    │    Dashboards    │    External APIs    │
│   (threshold,     │    (Grafana /    │                     │
│    anomaly)       │    custom UI)    │                     │
└────────────────────────────────────────────────────────────┘

The Collection Layer: Push vs. Pull

This is one of the most commonly examined trade-offs in this interview. You need to explain both models, their advantages, and when each breaks down.

The Pull Model (Prometheus-style)

The collection server initiates the connection. Each monitored service exposes an HTTP /metrics endpoint. The collector scrapes it at a fixed interval — typically every 15 or 30 seconds.

Advantages:

  • The collector is in control. You can immediately tell if a target is down — if the scrape fails, you know.
  • Debugging is trivial. Engineers can curl /metrics themselves to see what a service is currently reporting.
  • No configuration needed on services. Just expose the endpoint; the collector discovers targets via service discovery (Kubernetes, Consul, etc.).
  • Simple horizontal scaling. Run two Prometheus servers with the same config and you get HA for free.

Where it breaks down:

  • Short-lived jobs (batch jobs, cron tasks, serverless functions) might die before the collector gets a chance to scrape them. Prometheus solves this with a Pushgateway — a buffer the job pushes to, which Prometheus then scrapes. But Pushgateway becomes a single point of failure if overused.
  • Behind strict firewalls or NAT where the collector can't reach the service.
  • Very high-frequency events (millions per second). The pull model is for aggregated state snapshots, not individual event streams.

The Push Model (StatsD / Datadog agent / Telegraf)

Services actively send metrics to a collection endpoint. An agent on each host or a library in each service buffers metrics locally and pushes them periodically to a central collector.

Advantages:

  • Works for ephemeral services — they push before they die.
  • Easier to fan out to multiple destinations (push to Kafka, push to a backup system simultaneously).
  • Works behind firewalls — services initiate the outbound connection.
  • Natural fit for high-frequency counters. StatsD aggregates millions of events per second locally before sending a one-second summary.

Where it breaks down:

  • The collector can't easily detect if a service stopped sending — silence could mean the service is down or it just has nothing to report.
  • If the collector is slow or unavailable, agents need to buffer locally or drop data. Push creates backpressure that pull inherently avoids.
  • Services need to know where to send metrics. In a dynamic environment, this configuration has to be managed.

The Hybrid Answer

The pragmatic approach is to use pull for long-lived services where it works elegantly, and push for ephemeral processes and network-restricted sources — let the architecture fit the problem.

In practice, most large systems use a hybrid: Prometheus-style pull for stable services in Kubernetes, push via agents (Telegraf, Datadog agent) for hosts and batch jobs. Kafka sits in the middle as a buffer regardless of which collection model is used.

The push vs pull question is one where interviewers don't just want the trade-offs listed — they want to see you reason about which fits a specific scenario. "What if the service is behind a firewall?" or "What about serverless functions?" are the follow-ups that reveal whether you've actually thought it through. Mockingly.ai has observability system design simulations where exactly this kind of probing happens.


The Ingestion Buffer: Kafka

Whatever collection model you use, Kafka sits between collectors and storage. This isn't optional at scale — it's what keeps a slow write to the time-series database from causing a collector to block or drop data.

Two topics:

  • raw-metrics: metric data points (name, labels, timestamp, value)
  • raw-logs: raw log lines from all services

Why Kafka specifically?

  • Decouples producers (collectors) from consumers (processors and storage writers). If the TSDB is slow for 30 seconds, metrics buffer in Kafka rather than being dropped.
  • Allows multiple consumers. The same metric stream can feed the TSDB writer, the alert engine, and a real-time anomaly detector simultaneously.
  • Partitioned by metric name or service, keeping all data for a given metric on the same partition. This makes downstream stream aggregation much simpler.
  • Durable. A week of metric data buffered on disk is a real safety net during an outage.

Partitioning strategy: partition raw-metrics by a hash of the metric name and its label set. All data points for cpu.usage{host="web-01", region="us-east-1"} land on the same partition, making stream-level aggregation (calculating a 1-minute average in the processor) straightforward.


Time-Series Databases: What Makes Them Different

This is where most candidates give a vague "use InfluxDB or Prometheus" answer. Interviewers at Datadog or Google want to know why a time-series database exists and what it actually does differently.

Why Not Just Use PostgreSQL?

A naive schema would be:

sql
CREATE TABLE metrics (
    name      TEXT,
    labels    JSONB,
    timestamp TIMESTAMPTZ,
    value     FLOAT8
);

This works until you have 200,000 data points per second. Then it falls apart because:

  • Writes hit B-tree indexes on every insert, causing write amplification
  • Range queries (WHERE timestamp BETWEEN x AND y) on a row-oriented table read far more data than needed — each row contains all columns, even though queries only care about timestamp and value
  • The same metric name gets stored as a string millions of times

Time-series databases are engineered around one insight: time-series data is always written in timestamp order, queried in time ranges, and never updated. That access pattern allows specific optimisations that general-purpose databases can't make.

How InfluxDB's TSM Engine Works

InfluxDB is built on a custom storage engine called the Time-Structured Merge Tree (TSM), which is optimised for time-series data, using a variant of a log-structured merge tree with a write-ahead log, sharded by time.

The key ideas:

Time-based sharding. Data is organised into shards, each covering a fixed time window (e.g., one day). When you query a time range, the engine only opens the relevant shards. Data outside the query window is never touched.

Column-oriented storage within a shard. All values for the same metric are stored together in a column — not interleaved with other metrics. A query for the last hour of cpu.usage reads one continuous block from disk, not millions of scattered rows.

Delta and run-length encoding. Since timestamps are sequential and values often change slowly, delta encoding (storing the difference between consecutive timestamps rather than absolute values) achieves very high compression. Prometheus achieves approximately 1.3 bytes per sample through compression — a raw 64-bit float takes 8 bytes. This compression is only possible because the temporal ordering is guaranteed.

Compaction. In-memory writes are periodically flushed to immutable files on disk. Background compaction merges small files into large ones, dropping expired data and recompressing. This is what prevents read amplification — queries always hit large, sorted, contiguous files rather than thousands of small ones.

Push-Based vs. Pull-Based TSDBs

InfluxDB excels in high write scenarios, Prometheus is ideal for monitoring and alerting, while TimescaleDB offers the robustness of a relational database along with time series capabilities.

A brief comparison for your interview:

DatabaseModelStrengthsWeaknesses
PrometheusPullBuilt-in alerting, PromQL, simple opsNot designed for long-term storage; single-node by default
InfluxDBPushHigh write throughput, SQL-like queries, clusteringClustering is commercial only
VictoriaMetricsPush + PullHandles millions of metrics/second, excellent compression, Prometheus-compatibleLess ecosystem maturity
OpenTSDBPushHorizontally scalable via HBase/CassandraOperationally complex; OpenTSDB's storage is implemented on top of Hadoop and HBase, meaning you must accept the complexity of running a Hadoop/HBase cluster

For a new system: Prometheus + remote write to VictoriaMetrics or Thanos for long-term storage is a modern, widely used stack that doesn't require managing a full HBase cluster.


The Cardinality Problem — The Silent Killer

This deserves its own section because it catches experienced engineers off-guard, and interviewers at Datadog explicitly ask about it.

What is cardinality? In a time-series context, cardinality is the number of unique time series in your database. Each unique combination of a metric name and its label values is a separate series.

plaintext
http.requests{service="auth", host="web-01", region="us-east"}    → series 1
http.requests{service="auth", host="web-02", region="us-east"}    → series 2
http.requests{service="api",  host="web-01", region="eu-west"}    → series 3

With 10 services × 100 hosts × 3 regions × 5 environments = 15,000 unique series for one metric. That's fine.

Now someone adds a user_id label to track per-user request counts. With 1 million users, that single change creates 1 billion unique series. The index that maps label combinations to series data — stored in memory in Prometheus — exhausts RAM on the monitoring server within hours. This is a cardinality explosion.

Common sources of cardinality explosion:

  • Labels containing unbounded values: user IDs, session IDs, request IDs, IP addresses
  • Labels containing high-cardinality enums: product SKUs, customer accounts
  • Accidentally emitting labels that vary per request rather than per service

How the system handles it:

At the collection layer, the Metrics Processor drops or aggregates high-cardinality labels before they reach storage:

plaintext
BEFORE:  http.requests{service="auth", user_id="u_82736452"}
AFTER:   http.requests{service="auth"}   (user_id label dropped)

The pipeline should have a configurable cardinality limit per metric. If a metric exceeds the limit (e.g., more than 10,000 unique series), it gets flagged and optionally dropped or aggregated. Alert on cardinality growth — a sudden spike in series count is an instrumentation bug, not a business event.

Cardinality is one of those topics where interviewers at Datadog and Google specifically probe to separate candidates who've genuinely thought about production observability from those who've only read about it. Being able to explain the index exhaustion mechanism, not just name the problem, is what earns the follow-up questions. If that explanation isn't crisp yet, Mockingly.ai is a good place to pressure-test it.


Storage: Hot, Warm, Cold Tiering

Keeping all data on fast NVMe SSDs forever is economically unviable. A good metrics system uses tiered storage to keep costs linear with retention, not exponential.

The Tiering Model

Hot tier (last 24–72 hours)

  • Storage: NVMe SSD
  • Full resolution — every raw data point (e.g., 10-second intervals)
  • Sub-millisecond query latency
  • Most expensive per GB; kept small

Warm tier (3 days to 30 days)

  • Storage: HDD or standard object storage with SSD cache
  • Downsampled to 1-minute resolution (more on this below)
  • Query latency: tens to hundreds of milliseconds
  • ~6x cheaper than hot storage per GB

Cold tier (30 days to 2 years)

  • Storage: S3, GCS, or Glacier-class object storage
  • Downsampled to 1-hour resolution
  • Query latency: seconds to minutes for large ranges
  • ~50–100x cheaper than hot storage per GB

Practical example: 200,000 data points/second × 86,400 seconds/day × 50 bytes = ~864 GB/day raw. With tiering:

  • Hot tier holds 3 days: ~2.6 TB on NVMe
  • Warm tier holds 27 days: ~4 TB downsampled on HDD
  • Cold tier holds 335 days: ~1.5 TB heavily downsampled on S3

Without tiering you'd need 315 TB of NVMe per year. With tiering: under 10 TB. The cost difference is several orders of magnitude.

Downsampling: The Key to Tiering

Downsampling reduces the number of data points by computing aggregates over time windows. A week-old data point doesn't need 10-second granularity — a 1-minute average is fine for trend analysis and most alerting.

A background compaction job runs on a schedule:

plaintext
Every 5 minutes:
  For all metrics older than 72 hours:
    Group by (metric_name, labels, 1-minute window)
    Compute: min, max, sum, count, p99
    Store aggregated row, delete raw rows

The count is stored alongside the aggregate so you can correctly compute averages across windows when querying. If you just store avg, you can't combine two windows correctly — you need sum / count.

The system keeps raw data for 7 days and then downsamples to 1-minute and 1-hour aggregates as time progresses. This is a common policy, and it's reasonable to present in an interview with the caveat that the exact thresholds are configurable.


Log Storage: A Different Path

Logs share the Kafka ingestion buffer but diverge immediately after.

The log processor parses raw lines into structured events: extract timestamp, severity, service name, trace ID, and any other fields. This structured form is what gets indexed.

Elasticsearch is the standard storage engine for logs. Its inverted index makes full-text search ("show me all logs containing 'NullPointerException'") fast. Hot tier nodes handle the indexing load for time-series data such as logs and metrics and hold the most recent, most-frequently-accessed data. Warm tier nodes hold time-series data that is accessed less-frequently.

A typical log tiering policy:

  • Hot (last 7 days): full index on NVMe, sub-second search
  • Warm (7–30 days): compressed index on HDD, 1–5 second search
  • Cold (30–90 days): compressed snapshots on S3, minutes to restore and search
  • Delete at 90 days (or archive to Glacier for compliance)

The log retention period is a function of storage cost and compliance requirements. 7 days of hot storage for a 10,000-service company at 1 GB/second of logs is already 604 TB. This math forces you to either be selective about what you index (drop debug-level logs in production) or accept substantial infrastructure cost.


The Query Service

Raw storage isn't enough — you need a layer that makes queries fast and usable.

The Query Service sits between storage and consumers (dashboards, alert engine, API). It does four things:

1. Routing. A query for the last 30 minutes routes to the hot tier. A query for last month routes to cold tier. A query spanning both gets federated and merged. The caller doesn't think about tiers.

2. Caching. Dashboard queries for the last hour fire on every refresh. Cache these results in Redis with a 60-second TTL. A dashboard refreshing every 30 seconds doesn't need to re-query the TSDB for data that hasn't changed.

3. Query rewriting. Queries over large time ranges can be expensive. The Query Service rewrites them to use pre-computed downsampled data when possible. A query asking for daily averages over the last year should hit 1-hour-rollup data, not raw 10-second data.

4. Rate limiting. An engineer accidentally writing a query that joins 10 years of raw metrics can take down the query layer. Per-user query cost limits and timeouts protect everyone else.


The Alerting Pipeline

Alerting is where the system becomes operationally useful — and where subtle bugs have serious consequences. A false negative (missing a real incident) or a false positive (paging engineers at 3 AM for nothing) both erode trust in the system.

The Two Evaluation Approaches

Threshold alerting: fire when a metric crosses a fixed value. cpu.usage > 90% for 5 minutes. Simple, predictable, low latency.

Anomaly detection: fire when a metric deviates from its expected pattern. More sophisticated — accounts for daily and weekly seasonality, doesn't require manual threshold tuning — but harder to implement correctly and more prone to unexpected behaviour.

For the interview, describe threshold alerting in detail and mention anomaly detection as an extension.

The Alerting Pipeline End-to-End

plaintext
1. Alert rules stored in the database
   Rule: "cpu.usage{service="api"} > 90% for 5 minutes"
 
2. Alert Evaluator runs continuously (every 30 seconds)
   → Queries TSDB for the relevant metric
   → Checks if the condition has been true for the specified duration
   → If yes, transitions the alert from PENDING to FIRING
 
3. Firing alert published to Kafka topic: "alert-events"
 
4. Notification Router consumes from "alert-events"
   → Looks up on-call rotation (PagerDuty, OpsGenie)
   → Sends notification via configured channel:
      - Email for low-priority alerts
      - Slack/Teams for medium-priority
      - PagerDuty/phone for critical
   → Records the notification in the alert history
 
5. Alert resolves when condition is no longer true
   → Resolution notification sent to acknowledge the incident is over

The "for 5 minutes" duration requirement is important. Don't alert on a single data point breaching a threshold. A transient spike to 91% CPU for 10 seconds doesn't need to wake anyone up. The alert fires only after the condition has been continuously true for the specified window. This single requirement eliminates the majority of false positives.

Alert deduplication: if 50 hosts all breach their CPU threshold simultaneously (a deployment went wrong), the system should fire one grouped alert — not 50 separate pages. Group by service, environment, or datacenter before routing to on-call.

Dead man's switch: a special alert that fires if no data is received from a source for N minutes. If a host stops sending metrics entirely, it might be down — or the monitoring agent might have crashed. Dead man's switch alerts catch both.

The alerting pipeline section — especially the "for 5 minutes" requirement, deduplication, and dead man's switch — is where senior engineers distinguish themselves. Knowing the mechanism is different from being able to explain why each piece exists under interview conditions. Mockingly.ai has system design simulations built around exactly this question for engineers targeting Datadog, Google, Amazon, and Meta.


Scaling Discussion

Kafka Partitioning for Scale

At 200,000 data points per second, Kafka needs enough partitions to distribute write load. Partition by (metric_name + label_hash) % num_partitions. This ensures related metrics land on the same partition — good for stream aggregation — while distributing load across the cluster.

TSDB Horizontal Scaling

Single-node Prometheus handles roughly 10 million active series. Beyond that, you need federation.

Prometheus federation: a top-level Prometheus scrapes aggregated metrics from multiple child Prometheus servers. Each child covers a subset of targets. Works well but creates hierarchy complexity.

Remote write to VictoriaMetrics or Thanos: Prometheus instances write to a shared, scalable long-term storage backend. VictoriaMetrics can handle millions of metrics per second, outperforming both Prometheus and InfluxDB, with superior data compression. This is the modern recommendation for large-scale deployments.

Query Scaling

Cache aggressively. Rate limit expensive queries. Shard the Query Service by metric namespace (infrastructure metrics on one shard, application metrics on another). Pre-compute common aggregations as materialised views.


Common Interview Follow-ups

"How would you detect anomalies rather than just threshold breaches?"

Anomaly detection requires a baseline. For each metric, maintain a rolling statistical model of expected behaviour — typically a time-windowed mean and standard deviation, accounting for daily and weekly patterns. A metric is anomalous if it deviates more than N standard deviations from its expected value at that time of day. For more sophisticated detection, ARIMA or STL decomposition models handle seasonality better. Store the model parameters per metric; update them continuously as new data arrives. This is computationally non-trivial at 200,000 series but parallelises well.

"What happens if the TSDB goes down during a write spike?"

Kafka absorbs it. Collectors continue writing to Kafka. Messages accumulate in the topic. When the TSDB recovers, consumers resume from their last committed offset and catch up. No data is lost as long as the Kafka retention window (typically 7 days) hasn't been exceeded. This is the main reason Kafka is in the architecture, not an optimisation.

"How do you handle late-arriving data — a metric that shows up with a timestamp from 2 minutes ago?"

Late data is normal in distributed systems. The TSDB must handle out-of-order writes. InfluxDB and VictoriaMetrics both accept data with timestamps up to a configurable window in the past (typically minutes, not days). For the alert evaluator, this means a query window should look slightly further back than the strict alert duration to catch late-arriving data points. The trade-off: too long a lookback adds latency to alert firing; too short risks missing points.

"How would you design this as a multi-tenant SaaS (like Datadog)?"

Multi-tenancy adds namespace isolation — every metric, query, and alert must be scoped to a tenant. At the storage level, prefix all metric names and log indices with a tenant ID. At the Kafka level, use per-tenant topics or a single topic with tenant ID in the message. At the query level, inject the tenant filter on every request — the caller should never be able to query another tenant's data. At the billing level, meter per metric series per hour and per GB of logs indexed. Rate limit ingestion per tenant to prevent noisy neighbours from affecting shared infrastructure.

"Why would you choose InfluxDB over Prometheus for this design?"

Prometheus is optimised for the pull model and has a relatively constrained data model — Prometheus only collects aggregated time series data, not raw events, and a single Prometheus server can monitor over 10,000 machines. It's excellent for infrastructure monitoring at moderate scale. InfluxDB accepts pushed data and handles significantly higher ingest rates per node. For a system where services push metrics at high frequency — think StatsD-style counters from millions of events per second — InfluxDB's TSM engine and push model are a better fit. For a system built around Kubernetes where scraping is natural, Prometheus is simpler to operate.


Quick Interview Checklist

  • ✅ Distinguished metrics from logs — different patterns, different storage
  • ✅ Explained push vs. pull with real trade-offs, not just named them
  • ✅ Kafka as the ingestion buffer — absorbs write spikes, enables multiple consumers, no data loss on downstream failure
  • ✅ Kafka partitioned by metric name/label hash — same metric always on same partition
  • ✅ Explained why a TSDB exists — LSM-based storage, delta encoding, time-based sharding, column orientation
  • ✅ Compared Prometheus / InfluxDB / VictoriaMetrics / OpenTSDB — named a recommendation with justification
  • ✅ Cardinality problem — what it is, how it breaks systems, how to prevent it
  • ✅ Hot/warm/cold tiering — showed the math that makes tiering necessary
  • ✅ Downsampling with min/max/sum/count — explained why count must be stored
  • ✅ Log storage on Elasticsearch with ILM tiering
  • ✅ Query service — routing, caching, rewriting, rate limiting
  • ✅ Alert pipeline — evaluation interval, "for duration" requirement, deduplication, dead man's switch
  • ✅ Horizontal scaling — Prometheus federation or remote write to VictoriaMetrics/Thanos
  • ✅ Late-arriving data handling

Conclusion

This is one of the richest system design problems you can get in an interview. It touches almost every major concept: high-throughput ingestion, specialised storage engines, data tiering, stream processing, query optimisation, and real-time alerting. Companies like Datadog, Google, Amazon, and Meta ask it precisely because there's so much depth to explore.

The candidates who do well aren't the ones who know all the product names. They're the ones who can explain why a time-series database exists, why cardinality is dangerous, why you downsample rather than just delete old data, and why the alert rule needs a "for 5 minutes" qualifier. The reasoning behind the design is what the interviewer is testing, not the diagram.

The design pillars:

  1. Separate metrics and logs — different access patterns demand different storage engines from the start
  2. Kafka as the ingestion buffer — decouples collection from storage; no data loss when the TSDB is slow
  3. Time-series database with LSM/TSM storage — the only way to absorb 200K data points per second economically
  4. Cardinality discipline — one unbounded label can break the entire metrics system; enforce limits at ingestion
  5. Hot/warm/cold tiering — without it, storage costs grow linearly with time; with it, they plateau
  6. Downsampling stores min/max/sum/count — sum and count, not average, so you can combine windows correctly
  7. Alert deduplication and "for duration" — the two requirements that turn a noisy alerting system into a trusted one


Frequently Asked Questions

What is the difference between metrics and logs in a monitoring system?

Metrics are numerical measurements sampled at regular intervals. Logs are variable-length text events generated when something happens. They require different storage engines, different query patterns, and different retention strategies.

MetricsLogs
FormatName + labels + timestamp + numeric valueStructured or unstructured text
FrequencyRegular intervals (every 10–60 seconds)Event-driven (irregular)
Size per data point~50 bytes100 bytes – several KB
Primary query"What was CPU over the last hour?""Show me all errors containing 'timeout'"
Storage engineTime-series database (InfluxDB, Prometheus)Inverted index (Elasticsearch)
Retention1–2 years with downsampling30–90 days, compliance-driven

Why this distinction matters in an interview:

  1. Designing them with the same storage path is a mistake — the access patterns are incompatible
  2. They share the Kafka ingestion buffer but diverge immediately after into separate processing pipelines
  3. The cost implications are different: log storage at raw retention is catastrophically expensive; metrics storage with downsampling is manageable

Why use a time-series database instead of PostgreSQL for metrics?

A time-series database (TSDB) is optimised for the exact access pattern that metrics produce — append-only writes in timestamp order, time-range reads, and high compression. PostgreSQL handles this at small scale but degrades badly at 200,000+ data points per second.

Why PostgreSQL fails for metrics at scale:

  1. Write amplification — every insert updates a B-tree index. At 200,000 inserts/second, this generates enormous write I/O
  2. Row-oriented storage — a query for "CPU values over the last hour" reads every column in every row, even though only timestamp and value matter
  3. Repeated strings — the metric name cpu.usage is stored as a string in every single row, multiplied across millions of data points

How a TSDB fixes each problem:

  1. LSM-tree storage — writes are sequential appends to a log; no B-tree updates. Writes are fast by construction
  2. Column-oriented storage — all values for one metric are stored contiguously. A time-range query reads one continuous block from disk
  3. Dictionary encoding — metric names are stored once in a dictionary; data files reference the dictionary ID, not the string
  4. Delta encoding — since timestamps are sequential and values often change slowly, delta encoding achieves ~1.3 bytes per sample vs 12 bytes raw

What is the cardinality problem in metrics monitoring?

Cardinality is the number of unique time series in a metrics system. Each unique combination of metric name and label values is a separate series. Too many series exhausts RAM and crashes the monitoring system.

How cardinality explodes:

plaintext
http.requests{service="auth", host="web-01", region="us-east"}  → 1 series
http.requests{service="auth", host="web-02", region="eu-west"}  → 1 series
...1,000 hosts × 3 regions × 5 services = 15,000 series → manageable
 
Add label: user_id="u_82736452"
1,000,000 users × 1,000 hosts × 3 regions × 5 services = 15 billion series → system crash

Why it's dangerous:

  1. Prometheus stores its series index in memory — one series per entry
  2. A billion series exhausts the RAM of the monitoring server within hours
  3. Query performance degrades as the index grows, even before RAM is exhausted
  4. Adding a single unbounded label can cause this overnight

How to prevent it:

  1. Block high-cardinality labels at ingestion — drop user_id, request_id, session_id, ip_address labels before they reach the TSDB
  2. Enforce cardinality limits per metric — if a metric exceeds 10,000 unique label combinations, flag and drop the excess
  3. Alert on cardinality growth — a sudden spike in series count is an instrumentation bug, not a business event
  4. Use logs for high-cardinality data — per-user event data belongs in logs, not metrics

What is the difference between push and pull metrics collection?

Pull collection (Prometheus-style) means the monitoring system scrapes metrics from service endpoints on a schedule. Push collection (StatsD/Datadog-style) means services send metrics to a collector proactively.

Pull (Prometheus)Push (StatsD / Datadog agent)
Who initiatesCollector scrapes the serviceService sends to collector
Failure detectionImmediate — scrape failure = service downDelayed — silence may mean down or idle
Ephemeral servicesProblem — job may die before scrapeNatural — job pushes before it exits
Firewall/NATProblematic — collector must reach serviceWorks — service initiates outbound connection
DebuggingEasy — curl /metrics shows current stateHarder — no local endpoint to inspect
High-frequency eventsNot suitable — aggregated snapshots onlyNatural — buffer locally, send summary

The hybrid approach:

  1. Use pull for long-lived services in Kubernetes — elegant service discovery, immediate health detection
  2. Use push for batch jobs, serverless functions, and services behind firewalls
  3. Kafka sits in the middle as the buffer regardless of collection model

How does hot/warm/cold storage tiering work for metrics?

Tiered storage keeps recent high-resolution data on fast, expensive hardware and progressively downsamples older data onto cheaper storage. Without tiering, storage costs grow linearly with time at raw resolution — economically unviable at scale.

The three tiers:

TierAgeResolutionStorageLatency
HotLast 24–72 hoursFull resolution (10s)NVMe SSDSub-millisecond
Warm3–30 days1-minute aggregatesHDD / standard object storage10–100ms
Cold30 days – 2 years1-hour aggregatesS3 / GlacierSeconds – minutes

The cost impact at scale:

Without tiering: 200,000 data points/sec × 365 days = ~315 TB of NVMe per year. With tiering: hot tier ~2.6 TB + warm tier ~4 TB + cold tier ~1.5 TB = under 10 TB total.

The cost difference is 30–50×. Tiering is not an optimisation — it's what makes the system economically viable.


Why store min, max, sum, and count when downsampling — not just the average?

Storing only the average is lossy in a way that corrupts subsequent aggregations. To correctly combine multiple downsampled windows, you need sum and count, not avg.

Why average alone breaks:

plaintext
Window 1 (10 samples): avg = 80%, sum = 800, count = 10
Window 2 (90 samples): avg = 20%, sum = 1800, count = 90
 
Correct combined average: (800 + 1800) / (10 + 90) = 26%
Naive average of averages: (80 + 20) / 2 = 50%  ← wrong

The four values each serve a purpose:

  1. sum — required to compute correct averages across combined windows
  2. count — required to weight the average correctly (see above)
  3. min — preserves spike detection; a brief minimum value disappears in an average
  4. max — preserves peak detection; a brief maximum is hidden by averaging

If you store only avg, a CPU spike that lasted 30 seconds in an hour window is smoothed away. Storing max catches it even at 1-hour resolution.


How does the alerting pipeline work end-to-end?

The alerting pipeline evaluates metric conditions on a schedule, routes fired alerts to the right on-call channel, deduplicates related alerts, and confirms resolution when conditions clear.

The five stages:

  1. Rule storage — alert rules are persisted in a database: "cpu.usage{service='api'} > 90% for 5 minutes"
  2. Alert Evaluator — runs every 30 seconds, queries the TSDB, checks if the condition has been continuously true for the specified duration. Transitions from PENDING to FIRING only after the full duration passes
  3. Kafka routing — fired alerts are published to a Kafka alert-events topic, decoupling evaluation from notification
  4. Notification Router — consumes from alert-events, looks up on-call rotation, routes to the right channel (email / Slack / PagerDuty / phone) based on severity
  5. Resolution — when the condition is no longer true, a resolution notification is sent to close the incident

Three critical design decisions:

  1. "For duration" requirement — requiring 5 consecutive minutes at threshold before firing eliminates the majority of false positives from transient spikes
  2. Alert deduplication — 50 hosts all breaching CPU threshold simultaneously should produce one grouped alert, not 50 separate pages. Group by service, datacenter, or environment
  3. Dead man's switch — fires if no data is received from a source for N minutes. Catches silent failures where the monitoring agent itself has crashed

What is a dead man's switch alert and why is it important?

A dead man's switch is an alert that fires when it stops receiving data — the opposite of a threshold alert. It catches the failure mode where a host or service goes completely silent.

The problem it solves:

  1. A host loses its monitoring agent — metrics stop flowing
  2. A threshold alert never fires, because there are no data points to evaluate
  3. The host could be completely down, and the monitoring system says nothing

How it works:

  1. Each monitored source sends a "heartbeat" metric at regular intervals (e.g., every 60 seconds)
  2. The dead man's switch rule fires if no heartbeat arrives within 2–3× the expected interval
  3. The alert reads: "No data received from host=web-47 for the last 3 minutes"

Why this matters in system design:

A monitoring system that only alerts on threshold breaches has a blind spot: silent failures. The dead man's switch closes that gap. It's worth raising proactively in interviews — it shows you understand the difference between "metric is high" and "metric is missing."


Which companies ask the distributed metrics and logging system design question?

Datadog, Google, Amazon, Meta, LinkedIn, Stripe, and Uber ask variants of this question for senior software engineer and infrastructure roles.

Why it is a particularly probing interview question:

  1. Multiple hard sub-problems — cardinality, tiering, downsampling, and alerting correctness are each non-trivial on their own
  2. Reveals operational experience — candidates who've actually operated monitoring systems at scale know the cardinality problem, the "for duration" requirement, and dead man's switches from painful experience
  3. Scales to seniority — a junior answer describes "collect metrics, store in InfluxDB, build dashboards"; a senior answer explains LSM-tree storage, cardinality explosion, why sum/count beats avg, and alert deduplication

What interviewers specifically listen for:

  1. Distinguishing metrics from logs — and separating their storage paths immediately
  2. Cardinality named proactively — before the interviewer asks
  3. Downsampling stores sum and count, not avg — the specific reason, not just "we downsample old data"
  4. "For duration" on alerts — connecting this to false positive reduction
  5. Dead man's switch — shows depth beyond threshold-based alerting

The depth of this question means that reading about it and being able to talk through it under interview conditions are two very different skills. If you want to find out where your understanding actually breaks down before the real interview, Mockingly.ai has system design simulations built specifically for engineers preparing for senior roles at companies like Datadog, Google, Amazon, and Meta.

Companies That Ask This

Ready to Practice?

You've read the guide — now put your knowledge to the test. Our AI interviewer will challenge you with follow-up questions and give you real-time feedback on your system design.

Free tier includes unlimited practice with AI feedback • No credit card required

Related System Design Guides