Kafka Monitoring Best Practices: Complete Observability Guide
Learn how to monitor Apache Kafka in production with Prometheus, Grafana, and custom metrics. Monitor performance, detect issues, and optimize your Kafka cluster.
Kafka Monitoring Best Practices: Complete Observability Guide
Effective monitoring is crucial for running Apache Kafka in production. This comprehensive guide covers everything you need to know about monitoring Kafka clusters, from basic metrics to advanced observability patterns.
Why Monitor Kafka?
Key Benefits
- Proactive Issue Detection: Identify problems before they impact users
- Performance Optimization: Tune your cluster for better throughput
- Capacity Planning: Understand when to scale your infrastructure
- SLA Compliance: Ensure your system meets performance requirements
- Cost Optimization: Right-size your infrastructure based on actual usage
Monitoring Architecture
┌─────────────────────────────────────────────────────────────┐
│ Kafka Monitoring Stack │
├─────────────────────────────────────────────────────────────┤
│ Kafka Brokers (JMX Metrics) │
│ ├─ Broker 1 (Port 9999) │
│ ├─ Broker 2 (Port 9999) │
│ └─ Broker 3 (Port 9999) │
├─────────────────────────────────────────────────────────────┤
│ Metrics Collection │
│ ├─ JMX Exporter (Prometheus) │
│ ├─ Kafka Exporter (Custom Metrics) │
│ └─ Burrow (Consumer Lag) │
├─────────────────────────────────────────────────────────────┤
│ Time Series Database │
│ ├─ Prometheus (Metrics Storage) │
│ └─ InfluxDB (Optional) │
├─────────────────────────────────────────────────────────────┤
│ Visualization & Alerting │
│ ├─ Grafana (Dashboards) │
│ ├─ AlertManager (Alerts) │
│ └─ PagerDuty (Incident Management) │
└─────────────────────────────────────────────────────────────┘
Essential Metrics Categories
1. Broker Metrics
Throughput Metrics
# Messages per second
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
# Bytes per second
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
# Requests per second
kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce
kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Fetch
Latency Metrics
# Request latency
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Fetch
# Log flush time
kafka.log:type=LogFlushStats,name=LogFlushTimeMs
# Replication latency
kafka.server:type=ReplicaManager,name=PartitionCount
Resource Metrics
# JVM metrics
kafka.server:type=JvmMetrics,name=HeapMemoryCommittedUsed
kafka.server:type=JvmMetrics,name=NonHeapMemoryCommittedUsed
# Disk usage
kafka.log:type=LogFlushStats,name=LogFlushTimeMs
# Network I/O
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
2. Consumer Metrics
Consumer Lag
# Consumer group lag
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*
# Records consumed per second
kafka.consumer:type=consumer-fetch-manager-metrics,name=records-consumed-rate,client-id=*
# Fetch latency
kafka.consumer:type=consumer-fetch-manager-metrics,name=fetch-latency-avg,client-id=*
Consumer Health
# Consumer group status
kafka.consumer:type=consumer-coordinator-metrics,name=assigned-partitions,client-id=*
# Rebalance events
kafka.consumer:type=consumer-coordinator-metrics,name=rebalance-rate-per-hour,client-id=*
3. Producer Metrics
Producer Performance
# Records sent per second
kafka.producer:type=producer-metrics,client-id=*
# Batch size
kafka.producer:type=producer-metrics,name=record-send-rate,client-id=*
# Send latency
kafka.producer:type=producer-metrics,name=record-send-total,client-id=*
Setting Up Prometheus Monitoring
JMX Exporter Configuration
# jmx_exporter_config.yml
startDelaySeconds: 0
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
rules:
- pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec, topic=(.+)><>Value
name: kafka_topic_messages_in_per_sec
labels:
topic: "$1"
- pattern: kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec, topic=(.+)><>Value
name: kafka_topic_bytes_in_per_sec
labels:
topic: "$1"
- pattern: kafka.server<type=BrokerTopicMetrics, name=BytesOutPerSec, topic=(.+)><>Value
name: kafka_topic_bytes_out_per_sec
labels:
topic: "$1"
- pattern: kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(.+)><>Mean
name: kafka_request_latency_ms
labels:
request: "$1"
- pattern: kafka.server<type=ReplicaManager, name=PartitionCount><>Value
name: kafka_partition_count
- pattern: kafka.server<type=JvmMetrics, name=HeapMemoryCommittedUsed><>Value
name: kafka_jvm_heap_memory_used_bytes
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "kafka_alerts.yml"
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets: ['broker1:9092', 'broker2:9092', 'broker3:9092']
metrics_path: /metrics
scrape_interval: 10s
- job_name: 'kafka-jmx'
static_configs:
- targets: ['broker1:9999', 'broker2:9999', 'broker3:9999']
scrape_interval: 10s
- job_name: 'kafka-consumer-lag'
static_configs:
- targets: ['burrow:8080']
metrics_path: /metrics
scrape_interval: 30s
Grafana Dashboard Configuration
Key Dashboard Panels
1. Cluster Overview
{
"title": "Kafka Cluster Overview",
"panels": [
{
"title": "Messages per Second",
"type": "graph",
"targets": [
{
"expr": "sum(rate(kafka_topic_messages_in_per_sec[5m]))",
"legendFormat": "Total Messages/sec"
}
]
},
{
"title": "Bytes per Second",
"type": "graph",
"targets": [
{
"expr": "sum(rate(kafka_topic_bytes_in_per_sec[5m]))",
"legendFormat": "Bytes In/sec"
},
{
"expr": "sum(rate(kafka_topic_bytes_out_per_sec[5m]))",
"legendFormat": "Bytes Out/sec"
}
]
}
]
}
2. Consumer Lag Monitoring
{
"title": "Consumer Lag",
"panels": [
{
"title": "Consumer Group Lag",
"type": "graph",
"targets": [
{
"expr": "kafka_consumer_lag_sum",
"legendFormat": "{{consumer_group}} - {{topic}}"
}
]
},
{
"title": "Lag by Partition",
"type": "table",
"targets": [
{
"expr": "kafka_consumer_lag",
"format": "table"
}
]
}
]
}
3. Performance Metrics
{
"title": "Performance Metrics",
"panels": [
{
"title": "Request Latency",
"type": "graph",
"targets": [
{
"expr": "kafka_request_latency_ms{request=\"Produce\"}",
"legendFormat": "Produce Latency"
},
{
"expr": "kafka_request_latency_ms{request=\"Fetch\"}",
"legendFormat": "Fetch Latency"
}
]
},
{
"title": "JVM Memory Usage",
"type": "graph",
"targets": [
{
"expr": "kafka_jvm_heap_memory_used_bytes",
"legendFormat": "Heap Used"
}
]
}
]
}
Custom Monitoring with Burrow
Burrow Configuration
# burrow.toml
[general]
logdir = "/var/log/burrow"
pidfile = "/var/run/burrow.pid"
[zookeeper]
servers = ["zk1:2181", "zk2:2181", "zk3:2181"]
timeout = 6
root-path = "/burrow"
[kafka "primary"]
brokers = ["broker1:9092", "broker2:9092", "broker3:9092"]
zookeeper = ["zk1:2181", "zk2:2181", "zk3:2181"]
zookeeper-path = "/kafka"
zookeeper-offsets = true
zookeeper-offsets-path = "/consumers"
[consumer "primary"]
client-profile = "primary"
cluster = "primary"
servers = ["broker1:9092", "broker2:9092", "broker3:9092"]
group-blacklist = "console-consumer-.*"
group-whitelist = ""
[client-profile "primary"]
client-id = "burrow-lagchecker"
kafka-version = "2.8.0"
Burrow API Usage
# Get consumer group status
curl http://burrow:8080/v3/kafka/primary/consumer/analytics-group/status
# Get consumer lag
curl http://burrow:8080/v3/kafka/primary/consumer/analytics-group/lag
# Get topic details
curl http://burrow:8080/v3/kafka/primary/topic/user-events
Alerting Rules
Critical Alerts
# kafka_alerts.yml
groups:
- name: kafka-critical
rules:
- alert: KafkaBrokerDown
expr: up{job="kafka-brokers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka broker is down"
description: "Kafka broker {{ $labels.instance }} is down"
- alert: HighConsumerLag
expr: kafka_consumer_lag_sum > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag detected"
description: "Consumer group {{ $labels.consumer_group }} has {{ $value }} messages lag"
- alert: KafkaDiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/kafka-logs"} / node_filesystem_size_bytes{mountpoint="/kafka-logs"}) < 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Kafka disk space is low"
description: "Disk space on {{ $labels.instance }} is below 10%"
- alert: KafkaHighRequestLatency
expr: kafka_request_latency_ms{request="Produce"} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High Kafka request latency"
description: "Produce request latency is {{ $value }}ms on {{ $labels.instance }}"
Warning Alerts
- name: kafka-warning
rules:
- alert: KafkaHighMemoryUsage
expr: (kafka_jvm_heap_memory_used_bytes / kafka_jvm_heap_memory_max_bytes) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High JVM memory usage"
description: "JVM memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
- alert: KafkaLowThroughput
expr: rate(kafka_topic_messages_in_per_sec[5m]) < 100
for: 15m
labels:
severity: warning
annotations:
summary: "Low Kafka throughput"
description: "Message throughput is {{ $value }} messages/sec on topic {{ $labels.topic }}"
Custom Metrics Collection
Python Script for Custom Metrics
#!/usr/bin/env python3
import json
import time
import requests
from prometheus_client import start_http_server, Gauge, Counter
# Prometheus metrics
consumer_lag = Gauge('kafka_consumer_lag_custom', 'Consumer lag by group and topic', ['group', 'topic', 'partition'])
topic_size = Gauge('kafka_topic_size_bytes', 'Topic size in bytes', ['topic'])
partition_count = Gauge('kafka_topic_partition_count', 'Number of partitions per topic', ['topic'])
def collect_consumer_lag():
"""Collect consumer lag from Burrow API"""
try:
response = requests.get('http://burrow:8080/v3/kafka/primary/consumer')
consumers = response.json()
for consumer in consumers:
group = consumer['name']
lag_response = requests.get(f'http://burrow:8080/v3/kafka/primary/consumer/{group}/lag')
lag_data = lag_response.json()
for partition in lag_data['partitions']:
consumer_lag.labels(
group=group,
topic=partition['topic'],
partition=str(partition['partition'])
).set(partition['lag'])
except Exception as e:
print(f"Error collecting consumer lag: {e}")
def collect_topic_metrics():
"""Collect topic-level metrics"""
try:
response = requests.get('http://burrow:8080/v3/kafka/primary/topic')
topics = response.json()
for topic in topics:
topic_name = topic['name']
partition_count.labels(topic=topic_name).set(topic['partition_count'])
# Calculate total size (simplified)
total_size = sum(partition['size'] for partition in topic['partitions'])
topic_size.labels(topic=topic_name).set(total_size)
except Exception as e:
print(f"Error collecting topic metrics: {e}")
if __name__ == '__main__':
# Start Prometheus metrics server
start_http_server(8000)
# Collect metrics every 30 seconds
while True:
collect_consumer_lag()
collect_topic_metrics()
time.sleep(30)
Log Monitoring
ELK Stack Configuration
# logstash.conf
input {
file {
path => "/var/log/kafka/server.log"
type => "kafka-server"
}
file {
path => "/var/log/kafka/controller.log"
type => "kafka-controller"
}
}
filter {
if [type] == "kafka-server" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:log_message}" }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "kafka-logs-%{+YYYY.MM.dd}"
}
}
Log-based Alerts
# logstash-alerts.yml
- alert: KafkaErrorLogs
expr: increase(logstash_logs{level="ERROR"}[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate in Kafka logs"
description: "{{ $value }} errors in the last 5 minutes"
- alert: KafkaReplicationErrors
expr: increase(logstash_logs{message=~".*replication.*error.*"}[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka replication errors detected"
description: "Replication errors found in logs"
Performance Optimization Based on Metrics
1. Optimize Producer Performance
// Based on metrics showing high latency
Properties props = new Properties();
props.put("batch.size", 32768); // Increase batch size
props.put("linger.ms", 10); // Increase linger time
props.put("compression.type", "snappy"); // Enable compression
props.put("max.in.flight.requests.per.connection", 5); // Increase parallelism
2. Optimize Consumer Performance
// Based on metrics showing high consumer lag
Properties props = new Properties();
props.put("max.poll.records", 1000); // Increase poll size
props.put("fetch.min.bytes", 50000); // Increase fetch size
props.put("fetch.max.wait.ms", 500); // Increase wait time
3. Optimize Broker Configuration
# Based on metrics showing high disk I/O
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
log.flush.interval.messages=10000
log.flush.interval.ms=1000
Monitoring Best Practices
1. Set Up Proper Alerting
- Use multiple alert levels (critical, warning, info)
- Set appropriate thresholds based on your SLA
- Include runbook links in alert descriptions
- Test alerts regularly
2. Create Comprehensive Dashboards
- Separate dashboards for different audiences (ops, dev, management)
- Include both real-time and historical views
- Use consistent color coding and naming
- Include context and explanations
3. Monitor End-to-End
- Track messages from producer to consumer
- Monitor application-level metrics
- Include business metrics (orders processed, user events)
- Correlate Kafka metrics with application performance
4. Regular Health Checks
- Automated daily health reports
- Weekly capacity planning reviews
- Monthly performance trend analysis
- Quarterly architecture reviews
Troubleshooting Common Issues
High Consumer Lag
# Check consumer group status
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group analytics-group --describe
# Solutions:
# 1. Increase consumer instances
# 2. Increase partitions
# 3. Optimize consumer processing
# 4. Check for stuck consumers
High Request Latency
# Check request metrics
curl http://prometheus:9090/api/v1/query?query=kafka_request_latency_ms
# Solutions:
# 1. Increase broker resources
# 2. Optimize producer/consumer settings
# 3. Check network issues
# 4. Review topic partitioning
Memory Issues
# Check JVM metrics
curl http://prometheus:9090/api/v1/query?query=kafka_jvm_heap_memory_used_bytes
# Solutions:
# 1. Increase heap size
# 2. Optimize batch sizes
# 3. Check for memory leaks
# 4. Tune GC settings
Next Steps
This guide covered the essentials of Kafka monitoring, but there’s much more to explore:
- Advanced Metrics: Custom metrics, business KPIs
- Distributed Tracing: OpenTracing, Jaeger integration
- Machine Learning: Anomaly detection, predictive scaling
- Cloud Monitoring: AWS CloudWatch, Azure Monitor
- Cost Optimization: Resource utilization analysis
Ready to master Kafka monitoring and observability? Check out our comprehensive Apache Kafka Mastery Course that covers everything from fundamentals to production monitoring.
This article is part of our Observability series. Subscribe to get the latest monitoring and DevOps insights delivered to your inbox.