Lesson 5

Lag Management and Performance Monitoring

Master Kafka lag management: consumer lag metrics, monitoring with Prometheus/Grafana, alerting strategies, and performance optimization.

2-3 hours advanced Module 2: Performance

What is Consumer Lag?

Consumer lag = How far behind your consumer is from the latest message in a partition.

Partition 0: [msg-0] [msg-1] [msg-2] [msg-3] [msg-4] [msg-5] [msg-6]
                            ↑                    ↑
                      consumer offset        latest offset
                            (3)                  (6)
                           
                    Lag = 6 - 3 = 3 messages

Why lag matters:

  • High lag = Slow processing = Data staleness
  • Zero lag = Real-time processing = Good
  • Negative lag = Consumer ahead of producer = Impossible (indicates bug)

The 7 Common Lag Patterns

Here are the patterns I see in production and how to fix them:

1. The Steady Climb

Lag: 0 → 100 → 500 → 1000 → 5000 → 10000...
Cause: Consumer can't keep up with producer rate
Fix: Scale consumers or optimize processing

2. The Spiky Pattern

Lag: 0 → 0 → 0 → 5000 → 0 → 0 → 0...
Cause: Bursty traffic or slow processing spikes
Fix: Increase batch size, add buffering

3. The Staircase

Lag: 0 → 1000 → 1000 → 1000 → 2000 → 2000...
Cause: Consumer stuck on specific messages
Fix: Dead letter queue, skip problematic messages

4. The Roller Coaster

Lag: 0 → 5000 → 0 → 8000 → 0 → 12000...
Cause: Rebalancing storms
Fix: Static membership, cooperative rebalancing

5. The Plateau

Lag: 0 → 10000 → 10000 → 10000 → 10000...
Cause: Consumer crashed, not processing
Fix: Health checks, auto-restart

6. The Sawtooth

Lag: 0 → 1000 → 0 → 1000 → 0 → 1000...
Cause: Normal batch processing
Fix: Nothing - this is healthy

7. The Explosion

Lag: 0 → 0 → 0 → 1000000 → 2000000...
Cause: Consumer completely stopped
Fix: Emergency scaling, circuit breakers

Measuring Lag: The Right Way

There are multiple ways to measure lag, each with trade-offs:

1. JMX Metrics (Most Common)

# Per partition
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,topic=*,partition=*
records-lag-max
records-lag-avg

# Per consumer group
kafka.consumer:type=consumer-coordinator-metrics,client-id=*
rebalance-rate-per-hour

2. Kafka Admin API

from kafka.admin import KafkaAdminClient
from kafka.structs import OffsetSpec

admin = KafkaAdminClient(bootstrap_servers='localhost:9092')

# Get latest offsets
latest_offsets = admin.list_consumer_group_offsets('my-group')

# Get consumer offsets  
consumer_offsets = admin.list_consumer_group_offsets('my-group')

# Calculate lag
for partition, offset in consumer_offsets.items():
lag = latest_offsets[partition] - offset
print(f"Partition {partition}: {lag} messages behind")

3. Burrow (LinkedIn’s Tool)

# Install Burrow
go get github.com/linkedin/Burrow

# Start Burrow
./burrow --config-dir=./config

# Check lag via HTTP API
curl http://localhost:8000/v3/kafka/local/consumer/my-group/lag

Setting Up Prometheus + Grafana

The gold standard for Kafka monitoring:

1. JMX Exporter

# jmx_prometheus_javaagent.jar
java -javaagent:jmx_prometheus_javaagent.jar=8080:config.yaml \
-jar kafka-server-start.sh config/server.properties

2. Prometheus Config

# prometheus.yml
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets: ['broker1:8080', 'broker2:8080', 'broker3:8080']

- job_name: 'kafka-consumers'
static_configs:
- targets: ['consumer1:8080', 'consumer2:8080']

3. Grafana Dashboard

# Key panels to include:
- Consumer lag by group and partition
- Consumer throughput (messages/sec)
- Broker metrics (CPU, memory, disk)
- Rebalance frequency
- Error rates

Alerting Strategies

Don’t just monitor - alert on the right things:

🚨 Critical Alerts

  • Lag > 10,000 messages - Consumer falling behind

  • Zero consumer throughput - Consumer stopped

  • Rebalance frequency > 5/hour - Instability

  • Consumer error rate > 1% - Processing issues

⚠️ Warning Alerts

  • Lag > 1,000 messages - Getting behind

  • Consumer throughput 80% - Resource pressure

Lag-Based Autoscaling with KEDA

Automatically scale consumers based on lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
spec:
scaleTargetRef:
name: kafka-consumer
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: my-group
topic: my-topic
lagThreshold: '1000'

End-to-End Latency Measurement

Lag tells you how behind you are, but not how long messages take to process:

# Add timestamps to your messages
message = {
'data': payload,
'produced_at': time.time(),
'message_id': uuid.uuid4()
}

# In consumer
def process_message(message):
start_time = time.time()
produced_at = message['produced_at']

# Process message
result = do_work(message['data'])

# Calculate latencies
end_to_end_latency = start_time - produced_at
processing_latency = time.time() - start_time

# Log metrics
logger.info(f"E2E: {end_to_end_latency:.2f}s, Processing: {processing_latency:.2f}s")

Debugging Slow Consumers

When lag is high, here’s your debugging checklist:

1. Check Consumer Metrics

  • Are consumers actually running?

  • Are they polling frequently enough?

  • Are they committing offsets?

2. Check Processing Logic

  • Is processing taking too long?

  • Are there database bottlenecks?

  • Are there external API calls?

3. Check Resource Usage

  • CPU usage high?

  • Memory usage high?

  • Network I/O saturated?

4. Check Partition Distribution

  • Are all partitions being consumed?

  • Is one partition hot?

  • Are consumers evenly distributed?

Production Monitoring Checklist

✅ Must Have

  • Consumer lag monitoring (per partition)

  • Consumer throughput tracking

  • Error rate monitoring

  • Rebalance frequency alerts

  • End-to-end latency tracking

✅ Nice to Have

  • Consumer group health dashboard

  • Partition distribution visualization

  • Historical lag trends

  • Automated scaling based on lag

  • Dead letter queue monitoring

Key Takeaways

  1. Monitor lag patterns, not just numbers - patterns tell the real story
  2. Set up proper alerting - don’t just monitor, alert on what matters
  3. Use Prometheus + Grafana - industry standard for a reason
  4. Implement autoscaling - let the system scale itself
  5. Measure end-to-end latency - lag is just one part of the picture

Next Steps

Ready to optimize Kafka storage? Check out our next lesson on Storage, Retention, and Log Management where we’ll learn how Kafka stores data and optimize for performance.