Lag Management and Performance Monitoring
Master Kafka lag management: monitoring strategies, Burrow setup, Prometheus integration, and building production-ready dashboards for consumer performance.
What is Consumer Lag?
Consumer lag = How far behind your consumer is from the latest message in a partition.
Partition 0: [msg-0] [msg-1] [msg-2] [msg-3] [msg-4] [msg-5] [msg-6]
↑ ↑
consumer offset latest offset
(3) (6)
Lag = 6 - 3 = 3 messages Why lag matters:
- High lag = Slow processing = Data staleness
- Zero lag = Real-time processing = Good
- Negative lag = Consumer ahead of producer = Impossible (indicates bug)
The 7 Common Lag Patterns
Here are the patterns I see in production and how to fix them:
1. The Steady Climb
Lag: 0 → 100 → 500 → 1000 → 5000 → 10000...
Cause: Consumer can't keep up with producer rate
Fix: Scale consumers or optimize processing 2. The Spiky Pattern
Lag: 0 → 0 → 0 → 5000 → 0 → 0 → 0...
Cause: Bursty traffic or slow processing spikes
Fix: Increase batch size, add buffering 3. The Staircase
Lag: 0 → 1000 → 1000 → 1000 → 2000 → 2000...
Cause: Consumer stuck on specific messages
Fix: Dead letter queue, skip problematic messages 4. The Roller Coaster
Lag: 0 → 5000 → 0 → 8000 → 0 → 12000...
Cause: Rebalancing storms
Fix: Static membership, cooperative rebalancing 5. The Plateau
Lag: 0 → 10000 → 10000 → 10000 → 10000...
Cause: Consumer crashed, not processing
Fix: Health checks, auto-restart 6. The Sawtooth
Lag: 0 → 1000 → 0 → 1000 → 0 → 1000...
Cause: Normal batch processing
Fix: Nothing - this is healthy 7. The Explosion
Lag: 0 → 0 → 0 → 1000000 → 2000000...
Cause: Consumer completely stopped
Fix: Emergency scaling, circuit breakers Measuring Lag: The Right Way
There are multiple ways to measure lag, each with trade-offs:
1. JMX Metrics (Most Common)
# Per partition
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,topic=*,partition=*
records-lag-max
records-lag-avg
# Per consumer group
kafka.consumer:type=consumer-coordinator-metrics,client-id=*
rebalance-rate-per-hour 2. Kafka Admin API
from kafka.admin import KafkaAdminClient
from kafka.structs import OffsetSpec
admin = KafkaAdminClient(bootstrap_servers='localhost:9092')
# Get latest offsets
latest_offsets = admin.list_consumer_group_offsets('my-group')
# Get consumer offsets
consumer_offsets = admin.list_consumer_group_offsets('my-group')
# Calculate lag
for partition, offset in consumer_offsets.items():
lag = latest_offsets[partition] - offset
print(f"Partition {partition}: {lag} messages behind") 3. Burrow (LinkedIn's Tool)
# Install Burrow
go get github.com/linkedin/Burrow
# Start Burrow
./burrow --config-dir=./config
# Check lag via HTTP API
curl http://localhost:8000/v3/kafka/local/consumer/my-group/lag Setting Up Prometheus + Grafana
The gold standard for Kafka monitoring:
1. JMX Exporter
# jmx_prometheus_javaagent.jar
java -javaagent:jmx_prometheus_javaagent.jar=8080:config.yaml \
-jar kafka-server-start.sh config/server.properties 2. Prometheus Config
# prometheus.yml
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets: ['broker1:8080', 'broker2:8080', 'broker3:8080']
- job_name: 'kafka-consumers'
static_configs:
- targets: ['consumer1:8080', 'consumer2:8080'] 3. Grafana Dashboard
# Key panels to include:
- Consumer lag by group and partition
- Consumer throughput (messages/sec)
- Broker metrics (CPU, memory, disk)
- Rebalance frequency
- Error rates Alerting Strategies
Don't just monitor - alert on the right things:
🚨 Critical Alerts
- Lag > 10,000 messages - Consumer falling behind
- Zero consumer throughput - Consumer stopped
- Rebalance frequency > 5/hour - Instability
- Consumer error rate > 1% - Processing issues
⚠️ Warning Alerts
- Lag > 1,000 messages - Getting behind
- Consumer throughput < 50% of normal - Slow processing
- Memory usage > 80% - Resource pressure
Lag-Based Autoscaling with KEDA
Automatically scale consumers based on lag:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
spec:
scaleTargetRef:
name: kafka-consumer
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: my-group
topic: my-topic
lagThreshold: '1000' End-to-End Latency Measurement
Lag tells you how behind you are, but not how long messages take to process:
# Add timestamps to your messages
message = {
'data': payload,
'produced_at': time.time(),
'message_id': uuid.uuid4()
}
# In consumer
def process_message(message):
start_time = time.time()
produced_at = message['produced_at']
# Process message
result = do_work(message['data'])
# Calculate latencies
end_to_end_latency = start_time - produced_at
processing_latency = time.time() - start_time
# Log metrics
logger.info(f"E2E: {end_to_end_latency:.2f}s, Processing: {processing_latency:.2f}s") Debugging Slow Consumers
When lag is high, here's your debugging checklist:
1. Check Consumer Metrics
- Are consumers actually running?
- Are they polling frequently enough?
- Are they committing offsets?
2. Check Processing Logic
- Is processing taking too long?
- Are there database bottlenecks?
- Are there external API calls?
3. Check Resource Usage
- CPU usage high?
- Memory usage high?
- Network I/O saturated?
4. Check Partition Distribution
- Are all partitions being consumed?
- Is one partition hot?
- Are consumers evenly distributed?
Production Monitoring Checklist
✅ Must Have
- Consumer lag monitoring (per partition)
- Consumer throughput tracking
- Error rate monitoring
- Rebalance frequency alerts
- End-to-end latency tracking
✅ Nice to Have
- Consumer group health dashboard
- Partition distribution visualization
- Historical lag trends
- Automated scaling based on lag
- Dead letter queue monitoring
Key Takeaways
- Monitor lag patterns, not just numbers - patterns tell the real story
- Set up proper alerting - don't just monitor, alert on what matters
- Use Prometheus + Grafana - industry standard for a reason
- Implement autoscaling - let the system scale itself
- Measure end-to-end latency - lag is just one part of the picture
Next Steps
Ready to optimize Kafka storage? Check out our next lesson on Storage, Retention, and Log Management where we'll learn how Kafka stores data and optimize for performance.