Kafka Production Deployment Guide: Best Practices for Scale
Complete guide to deploying Apache Kafka in production. Learn monitoring, security, performance tuning, and disaster recovery strategies for enterprise environments.
Kafka Production Deployment Guide: Best Practices for Scale
Deploying Apache Kafka in production requires careful planning, monitoring, and optimization. This comprehensive guide covers everything you need to know to run Kafka at scale in enterprise environments.
Production Architecture Overview
Recommended Cluster Setup
┌─────────────────────────────────────────────────────────────┐
│ Kafka Production Cluster │
├─────────────────────────────────────────────────────────────┤
│ Load Balancer (HAProxy/Nginx) │
├─────────────────────────────────────────────────────────────┤
│ Kafka Brokers (3-5 nodes) │
│ ├─ Broker 1 (Controller + Data) │
│ ├─ Broker 2 (Data) │
│ ├─ Broker 3 (Data) │
│ └─ Broker 4 (Data) │
├─────────────────────────────────────────────────────────────┤
│ Zookeeper Ensemble (3-5 nodes) │
│ ├─ ZK-1 (Leader) │
│ ├─ ZK-2 (Follower) │
│ └─ ZK-3 (Follower) │
├─────────────────────────────────────────────────────────────┤
│ Monitoring Stack │
│ ├─ Prometheus + Grafana │
│ ├─ ELK Stack (Logs) │
│ └─ Jaeger (Tracing) │
└─────────────────────────────────────────────────────────────┘
Hardware Requirements
Broker Specifications
Minimum Production Setup
- CPU: 8+ cores (Intel Xeon or AMD EPYC)
- RAM: 32GB+ (JVM heap: 8GB, OS cache: 24GB)
- Storage: 4TB+ NVMe SSD (RAID 1 for OS, RAID 10 for data)
- Network: 10Gbps+ network interface
Recommended Production Setup
- CPU: 16+ cores
- RAM: 64GB+ (JVM heap: 16GB, OS cache: 48GB)
- Storage: 8TB+ NVMe SSD
- Network: 25Gbps+ network interface
Zookeeper Specifications
- CPU: 4+ cores
- RAM: 16GB+
- Storage: 100GB+ SSD
- Network: 1Gbps+
Configuration Best Practices
Broker Configuration
# server.properties
# Broker ID (unique per broker)
broker.id=1
# Network settings
listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9093
advertised.listeners=PLAINTEXT://broker1.example.com:9092,SSL://broker1.example.com:9093
# Zookeeper connection
zookeeper.connect=zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181
# Log settings
log.dirs=/kafka-logs
num.partitions=3
default.replication.factor=3
min.insync.replicas=2
# Performance tuning
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
# Log retention
log.retention.hours=168
log.retention.bytes=1073741824
log.segment.bytes=1073741824
log.cleanup.policy=delete
# Replication
replica.fetch.max.bytes=1048576
replica.socket.timeout.ms=30000
replica.lag.time.max.ms=10000
# Controller settings
controller.socket.timeout.ms=30000
JVM Configuration
# kafka-server-start.sh
export KAFKA_HEAP_OPTS="-Xmx16G -Xms16G"
export KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true"
OS Tuning
# /etc/sysctl.conf
# Network tuning
net.core.rmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_default = 262144
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 5000
# File descriptor limits
fs.file-max = 2097152
vm.swappiness = 1
# Apply settings
sysctl -p
Security Configuration
SSL/TLS Encryption
# SSL Configuration
security.inter.broker.protocol=SSL
listeners=SSL://0.0.0.0:9092
ssl.keystore.location=/var/ssl/private/kafka.server.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234
ssl.truststore.location=/var/ssl/private/kafka.server.truststore.jks
ssl.truststore.password=test1234
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.2,TLSv1.3
ssl.keystore.type=JKS
ssl.truststore.type=JKS
SASL Authentication
# SASL Configuration
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=PLAIN
sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512
listeners=SASL_SSL://0.0.0.0:9092
ACL Authorization
# Create admin user
kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 \
--add --allow-principal User:admin --operation All --topic '*' --group '*'
# Create producer user
kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 \
--add --allow-principal User:producer --operation Write --topic 'user-events'
# Create consumer user
kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 \
--add --allow-principal User:consumer --operation Read --topic 'user-events' --group 'analytics-group'
Monitoring and Observability
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets: ['broker1:9092', 'broker2:9092', 'broker3:9092']
metrics_path: /metrics
scrape_interval: 10s
- job_name: 'kafka-jmx'
static_configs:
- targets: ['broker1:9999', 'broker2:9999', 'broker3:9999']
scrape_interval: 10s
Grafana Dashboards
{
"dashboard": {
"title": "Kafka Production Monitoring",
"panels": [
{
"title": "Message Rate",
"type": "graph",
"targets": [
{
"expr": "rate(kafka_server_brokertopicmetrics_messagesinpersec[5m])",
"legendFormat": "Messages/sec"
}
]
},
{
"title": "Consumer Lag",
"type": "graph",
"targets": [
{
"expr": "kafka_consumer_lag_sum",
"legendFormat": "Total Lag"
}
]
}
]
}
}
Key Metrics to Monitor
Broker Metrics
# Message throughput
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
# Request latency
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
# Disk usage
kafka.log:type=LogFlushStats,name=LogFlushTimeMs
# Replication lag
kafka.server:type=ReplicaManager,name=PartitionCount
Consumer Metrics
# Consumer lag
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*
# Consumer rate
kafka.consumer:type=consumer-fetch-manager-metrics,name=records-consumed-rate,client-id=*
Performance Tuning
Producer Optimization
// High-throughput producer configuration
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092,broker3:9092");
props.put("acks", "1"); // Faster than "all"
props.put("retries", 3);
props.put("batch.size", 16384); // 16KB batches
props.put("linger.ms", 5); // Wait up to 5ms for batching
props.put("buffer.memory", 33554432); // 32MB buffer
props.put("compression.type", "snappy"); // Compress messages
props.put("max.in.flight.requests.per.connection", 5);
Consumer Optimization
// High-throughput consumer configuration
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092,broker3:9092");
props.put("group.id", "analytics-group");
props.put("enable.auto.commit", false); // Manual commit for better control
props.put("auto.offset.reset", "earliest");
props.put("max.poll.records", 500); // Process more records per poll
props.put("fetch.min.bytes", 1);
props.put("fetch.max.wait.ms", 500);
props.put("max.partition.fetch.bytes", 1048576); // 1MB per partition
Topic Configuration
# Create optimized topic
kafka-topics.sh --create \
--topic user-events \
--bootstrap-server localhost:9092 \
--partitions 12 \
--replication-factor 3 \
--config min.insync.replicas=2 \
--config cleanup.policy=delete \
--config retention.ms=604800000 \
--config segment.ms=3600000 \
--config compression.type=snappy
Disaster Recovery
Backup Strategy
#!/bin/bash
# kafka-backup.sh
# Backup topic configurations
kafka-topics.sh --bootstrap-server localhost:9092 --list | \
xargs -I {} kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic {} > topic-configs.txt
# Backup consumer group offsets
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list | \
xargs -I {} kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {} > consumer-groups.txt
# Backup Zookeeper data
tar -czf zookeeper-backup-$(date +%Y%m%d).tar.gz /var/lib/zookeeper/
Cross-Datacenter Replication
# MirrorMaker 2 configuration
clusters=primary,secondary
primary.bootstrap.servers=broker1.primary.com:9092
secondary.bootstrap.servers=broker1.secondary.com:9092
# Replication settings
replication.factor=3
checkpoints.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
Deployment Automation
Docker Compose
# docker-compose.yml
version: '3.8'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.4.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
volumes:
- zk-data:/var/lib/zookeeper/data
- zk-logs:/var/lib/zookeeper/log
kafka:
image: confluentinc/cp-kafka:7.4.0
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
volumes:
- kafka-data:/var/lib/kafka/data
volumes:
zk-data:
zk-logs:
kafka-data:
Kubernetes Deployment
# kafka-deployment.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
spec:
serviceName: kafka
replicas: 3
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
containers:
- name: kafka
image: confluentinc/cp-kafka:7.4.0
ports:
- containerPort: 9092
env:
- name: KAFKA_BROKER_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KAFKA_ZOOKEEPER_CONNECT
value: "zk-0.zk:2181,zk-1.zk:2181,zk-2.zk:2181"
- name: KAFKA_ADVERTISED_LISTENERS
value: "PLAINTEXT://$(POD_IP):9092"
volumeMounts:
- name: kafka-data
mountPath: /var/lib/kafka/data
volumeClaimTemplates:
- metadata:
name: kafka-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
Troubleshooting Common Issues
High Consumer Lag
# Check consumer lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group analytics-group --describe
# Solutions:
# 1. Increase consumer instances
# 2. Increase partitions
# 3. Optimize consumer processing
# 4. Check for stuck consumers
Broker Out of Memory
# Monitor JVM heap
jstat -gc <kafka-pid> 1s
# Solutions:
# 1. Increase heap size
# 2. Optimize batch sizes
# 3. Check for memory leaks
# 4. Tune GC settings
Disk Space Issues
# Check disk usage
df -h /kafka-logs
# Clean up old segments
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --json | jq '.brokers[].logDirs[] | select(.error != null)'
# Solutions:
# 1. Increase retention time
# 2. Add more disk space
# 3. Implement log compaction
# 4. Archive old data
Production Checklist
Pre-Deployment
- Hardware requirements met
- Network configuration tested
- Security settings configured
- Monitoring stack deployed
- Backup strategy implemented
- Disaster recovery plan ready
Post-Deployment
- All metrics collecting
- Alerts configured
- Performance benchmarks met
- Security audit completed
- Documentation updated
- Team training completed
Next Steps
This guide covered the essentials of Kafka production deployment, but there’s much more to explore:
- Advanced Security: OAuth2, mTLS, RBAC
- Multi-Region Deployment: Global data replication
- Stream Processing: Kafka Streams, KSQL
- Schema Management: Schema Registry, Avro
- Cloud Deployment: AWS MSK, Confluent Cloud
Ready to master Kafka production deployment? Check out our comprehensive Apache Kafka Mastery Course that covers everything from fundamentals to production operations.
This article is part of our Production Operations series. Subscribe to get the latest DevOps and infrastructure insights delivered to your inbox.