Course Navigation
← Back to Course OverviewAll Lessons
✓
Introduction and Database Fundamentals ✓
Building the Core Data Structure ✓
Concurrency and Thread Safety ✓
Append-Only Log (Write-Ahead Log) ✓
SSTables and LSM Trees ✓
Compaction and Optimization ✓
TCP Server and Protocol Design ✓
Client Library and Advanced Networking ✓
Transactions and ACID Properties ✓
Replication and High Availability 11
Monitoring, Metrics, and Observability 12
Performance Optimization and Tuning 13
Configuration and Deployment 14
Security and Production Hardening 15
Final Project and Beyond Current Lesson
11 of 15
Progress 73%
Monitoring, Metrics, and Observability
Learning Objectives
- • Master Prometheus metrics for system monitoring
- • Implement structured logging with zap
- • Use performance profiling to identify bottlenecks
- • Build health checks for orchestration
- • Design comprehensive observability systems
- • Monitor the four golden signals of system health
Lesson 11.1: Prometheus Metrics
What are Metrics?
Metrics are measurements that help you understand system behavior:
Counter
Only increases
requests_total
errors_total
bytes_written_total Gauge
Can go up or down
current_connections
memory_usage_bytes
queue_length Histogram
Distribution of values
latency_seconds with buckets
request_size_bytes
response_time Summary
Percentiles computed on client side
quantile="0.5" value="0.1"
quantile="0.95" value="0.5"
quantile="0.99" value="1.0" Four Golden Signals
Every system needs these 4 metrics for comprehensive monitoring:
1. Latency - How long requests take
Metric: request_duration_seconds
Target: p99 < 100ms
2. Traffic - How much work happening
Metric: requests_per_second
Target: 10,000+ ops/sec
3. Errors - How many requests fail
Metric: error_rate_percent
Target: < 1%
4. Saturation - How utilized is system
Metric: cpu_percent, memory_percent
Target: < 70% Implementing Prometheus Metrics
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
type Metrics struct {
// Counters - always increase
RequestsTotal prometheus.Counter
ErrorsTotal prometheus.Counter
// Gauges - can go up/down
ConnectionCount prometheus.Gauge
MemoryBytes prometheus.Gauge
// Histograms - distribution with buckets
RequestLatency prometheus.Histogram
}
// NewMetrics creates all metrics
func NewMetrics() *Metrics {
return &Metrics{
RequestsTotal: promauto.NewCounter(prometheus.CounterOpts{
Name: "kvdb_requests_total",
Help: "Total number of requests",
}),
ErrorsTotal: promauto.NewCounter(prometheus.CounterOpts{
Name: "kvdb_errors_total",
Help: "Total number of errors",
}),
ConnectionCount: promauto.NewGauge(prometheus.GaugeOpts{
Name: "kvdb_connections",
Help: "Current number of connections",
}),
MemoryBytes: promauto.NewGauge(prometheus.GaugeOpts{
Name: "kvdb_memory_bytes",
Help: "Memory usage in bytes",
}),
RequestLatency: promauto.NewHistogram(prometheus.HistogramOpts{
Name: "kvdb_request_latency_seconds",
Help: "Request latency",
Buckets: []float64{0.001, 0.01, 0.1, 1},
}),
}
}
// RecordRequest records a request
func (m *Metrics) RecordRequest(duration float64, hasError bool) {
m.RequestsTotal.Inc()
m.RequestLatency.Observe(duration)
if hasError {
m.ErrorsTotal.Inc()
}
}
// UpdateConnections updates connection gauge
func (m *Metrics) UpdateConnections(count int64) {
m.ConnectionCount.Set(float64(count))
}
// UpdateMemory updates memory gauge
func (m *Metrics) UpdateMemory(bytes int64) {
m.MemoryBytes.Set(float64(bytes))
} Using Metrics in Server
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
type Server struct {
metrics *Metrics
}
// HandleRequest records metrics for each request
func (s *Server) HandleRequest(operation string) error {
start := time.Now()
// Do work
err := s.doWork()
// Record metrics
duration := time.Since(start).Seconds()
s.metrics.RecordRequest(duration, err != nil)
return err
}
// ExposeMetrics starts Prometheus endpoint
func (s *Server) ExposeMetrics(addr string) {
http.Handle("/metrics", promhttp.Handler())
go http.ListenAndServe(addr, nil)
} Lesson 11.2: Structured Logging
Problem: Unstructured Logs
❌ BAD
2025-01-16 10:15:23 Request from 192.168.1.1 with key user:100 took 50ms
Problems:
- Hard to parse
- Inconsistent format
- Can't filter/search easily Solution: Structured Logging
✅ GOOD
{
"timestamp": "2025-01-16T10:15:23Z",
"level": "error",
"message": "request failed",
"client_ip": "192.168.1.1",
"key": "user:100",
"duration_ms": 50,
"error": "timeout"
}
Benefits:
- Easy to parse
- Consistent format
- Can filter/search
- Can aggregate Using zap Logger
import "go.uber.org/zap"
// Create logger
logger, _ := zap.NewProduction()
defer logger.Sync()
// Log with fields
logger.Info("operation completed",
zap.String("operation", "GET"),
zap.String("key", "user:100"),
zap.Duration("latency", 50*time.Millisecond),
zap.Int("status_code", 200),
)
// Log errors
logger.Error("operation failed",
zap.String("operation", "PUT"),
zap.Error(err),
zap.Duration("latency", 100*time.Millisecond),
)
// Log levels
logger.Debug("debug info")
logger.Info("informational")
logger.Warn("warning")
logger.Error("error") Lesson 11.3: Performance Profiling
CPU Profiling
import (
"runtime/pprof"
"os"
)
// Start profiling
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
defer f.Close()
// Run operations
for i := 0; i < 1000000; i++ {
store.Get([]byte("key"))
}
// Analyze: go tool pprof cpu.prof Memory Profiling
import "runtime/pprof"
// Capture heap
f, _ := os.Create("heap.prof")
pprof.WriteHeapProfile(f)
f.Close()
// Analyze: go tool pprof heap.prof HTTP Profiling Endpoint
import _ "net/http/pprof"
// Available endpoints:
// /debug/pprof/profile?seconds=30 - CPU profile
// /debug/pprof/heap - Memory profile
// /debug/pprof/goroutine - Goroutine profile Lesson 11.4: Health Checks
Liveness Probe (Is server alive?)
func (s *Server) Liveness() bool {
// Simple check - is process running?
return s.db != nil
}
// HTTP endpoint
func (s *Server) handleLiveness(w http.ResponseWriter, r *http.Request) {
if s.Liveness() {
w.WriteHeader(200)
w.Write([]byte("alive"))
} else {
w.WriteHeader(500)
}
}
// Usage: GET /healthz Readiness Probe (Can serve requests?)
func (s *Server) Readiness() (bool, string) {
// Database check
_, err := s.db.Get(context.Background(), []byte("health"))
if err != nil {
return false, "database down"
}
// Replication check
lag := s.replication.GetLag()
if lag > 5*time.Second {
return false, "replication lag too high"
}
// Disk check
if s.getDiskUsagePercent() > 90 {
return false, "disk full"
}
return true, "ready"
}
// HTTP endpoint
func (s *Server) handleReadiness(w http.ResponseWriter, r *http.Request) {
ready, reason := s.Readiness()
if ready {
w.WriteHeader(200)
w.Write([]byte("ready"))
} else {
w.WriteHeader(503)
w.Write([]byte(reason))
}
}
// Usage: GET /readyz Lab 11.1: Monitoring System
Objective
Build a comprehensive monitoring system with metrics, logging, profiling, and health checks.
Requirements
- • Prometheus Metrics: Counters, gauges, histograms for all operations
- • Structured Logging: JSON logs with zap for all events
- • Performance Profiling: CPU and memory profiling endpoints
- • Health Checks: Liveness and readiness probes
- • Monitoring Dashboard: Grafana dashboard with key metrics
- • Alerting: Alert rules for critical metrics
Starter Code
type MonitoringSystem struct {
metrics *Metrics
logger *zap.Logger
}
func NewMonitoringSystem() *MonitoringSystem {
metrics := NewMetrics()
logger, _ := zap.NewProduction()
return &MonitoringSystem{metrics, logger}
}
func (ms *MonitoringSystem) RecordOperation(op string, duration time.Duration, err error) {
ms.metrics.RecordRequest(duration.Seconds(), err != nil)
if err != nil {
ms.logger.Error("operation failed",
zap.String("operation", op),
zap.Duration("latency", duration),
zap.Error(err),
)
} else {
ms.logger.Info("operation succeeded",
zap.String("operation", op),
zap.Duration("latency", duration),
)
}
}
// TODO: Implement health checks
func (ms *MonitoringSystem) Liveness() bool {
return true
}
func (ms *MonitoringSystem) Readiness() (bool, string) {
return true, "ready"
}
// TODO: Implement profiling endpoints
func (ms *MonitoringSystem) SetupProfiling() {
// Add /debug/pprof endpoints
} Test Template
func TestMetrics(t *testing.T) {
metrics := NewMetrics()
// Record some operations
metrics.RecordRequest(0.1, false)
metrics.RecordRequest(0.2, true)
metrics.UpdateConnections(100)
// Verify metrics were recorded
// (In real test, you'd check Prometheus registry)
}
func TestHealthChecks(t *testing.T) {
ms := NewMonitoringSystem()
// Test liveness
assert.True(t, ms.Liveness())
// Test readiness
ready, reason := ms.Readiness()
assert.True(t, ready)
assert.Equal(t, "ready", reason)
}
func TestLogging(t *testing.T) {
logger, _ := zap.NewDevelopment()
logger.Info("test message",
zap.String("key", "value"),
zap.Int("count", 42),
)
// In real test, you'd capture log output
} Acceptance Criteria
- ✅ All operations record metrics
- ✅ Structured JSON logs for all events
- ✅ CPU and memory profiling working
- ✅ Liveness probe responds correctly
- ✅ Readiness probe checks all dependencies
- ✅ Grafana dashboard shows key metrics
- ✅ Alert rules trigger on critical conditions
- ✅ > 90% code coverage
- ✅ All tests pass
Summary: Week 11 Complete
By completing Week 11, you've learned and implemented:
1. Prometheus Metrics
- • Counter, Gauge, Histogram, Summary
- • Four golden signals monitoring
- • Custom metrics for database operations
- • Prometheus endpoint exposure
2. Structured Logging
- • JSON structured logs with zap
- • Consistent log format
- • Easy filtering and searching
- • Log levels and context
3. Performance Profiling
- • CPU profiling for hot paths
- • Memory profiling for leaks
- • HTTP profiling endpoints
- • Goroutine analysis
4. Health Checks
- • Liveness probe (is alive?)
- • Readiness probe (can serve?)
- • Dependency health checks
- • Orchestration integration
Key Skills Mastered:
- ✅ Monitor system with Prometheus metrics
- ✅ Debug with structured JSON logs
- ✅ Profile CPU and memory usage
- ✅ Health checks for orchestration
- ✅ Four golden signals monitoring
- ✅ Production-ready observability
Ready for Week 12?
Next week we'll focus on performance optimization, load testing, and capacity planning to maximize system performance.
Continue to Week 12: Performance Optimization →