Course Navigation
← Back to Course OverviewAll Lessons
✓
Introduction and Database Fundamentals ✓
Building the Core Data Structure ✓
Concurrency and Thread Safety ✓
Append-Only Log (Write-Ahead Log) 5
SSTables and LSM Trees 6
Compaction and Optimization 7
TCP Server and Protocol Design 8
Client Library and Advanced Networking 9
Transactions and ACID Properties 10
Replication and High Availability 11
Monitoring, Metrics, and Observability 12
Performance Optimization and Tuning 13
Configuration and Deployment 14
Security and Production Hardening 15
Final Project and Beyond Current Lesson
5 of 15
Progress 33%
SSTables and LSM Trees
Learning Objectives
- • Understand LSM Tree architecture and its benefits
- • Implement Skip Lists for sorted memtables
- • Build Sorted String Tables (SSTables) for disk storage
- • Master dual memtable design for concurrent access
- • Implement background flushing mechanisms
Lesson 5.1: LSM Tree Architecture
The Problem with Append-Only Logs
Current system has limitations that LSM Trees solve:
Current System Issues:
- • Data stored in memory (HashMap) - not sorted
- • All writes logged to disk (WAL) - not indexed
- • No efficient range queries
- • Recovery must replay entire log
In-Memory:
"user:1" → "alice"
"user:2" → "bob"
"user:3" → "charlie"
(randomly ordered)
To find all keys between "user:1" and "user:3":
└─ Must scan all keys (no index)
└─ O(n) operation
To persist to disk efficiently:
└─ WAL is append-only, not indexed
└─ Recovery must replay entire log LSM Tree Overview
LSM = Log-Structured Merge Tree
The key insight: Optimize for writes, not reads
┌────────────────────────────────────┐
│ In-Memory (Memtable) │
│ Sorted structure (SkipList) │
│ Size limit: 2MB │
└────────────────────────────────────┘
│
│ (When full)
▼
┌────────────────────────────────────┐
│ Level 0: SSTables (on disk) │
│ Multiple small files (0-4) │
└────────────────────────────────────┘
│
│ (Compaction)
▼
┌────────────────────────────────────┐
│ Level 1: SSTables (on disk) │
│ Fewer, larger files (10-20) │
└────────────────────────────────────┘
│
│ (Compaction)
▼
┌────────────────────────────────────┐
│ Level 2: SSTables (on disk) │
│ Even fewer, even larger files │
└────────────────────────────────────┘ Memtable and Immutable Memtable
Current problem: Single memtable causes contention between reads and writes.
Solution: Dual memtables
Writer goroutines:
└─ Write to Active memtable
Reader goroutines:
├─ Search Active memtable
└─ Search Immutable memtable
Flusher goroutine (background):
├─ When Active full
├─ Swap: Active → Immutable
├─ Create new Active
└─ Flush Immutable to disk
Timeline:
t0: Active = MT0 (empty), Immutable = none
t1: Writes to MT0
t2: MT0 full → Immutable = MT0, Active = MT1 (new)
t3: Writes to MT1
t4: Flusher: Flush Immutable to disk
t5: Immutable = MT1, Active = MT2 Benefits:
- • Writes don't block on disk I/O
- • Reads can search both memtables
- • Flushing happens in background
Sorted String Tables (SSTables)
What is an SSTable? A file on disk containing sorted key-value pairs.
SSTable format:
┌──────────────────────────────┐
│ Header │
│ - Version: 1 │
│ - EntryCount: 1000 │
│ - MinKey: "user:100" │
│ - MaxKey: "user:999" │
│ - BloomFilter: (optional) │
├──────────────────────────────┤
│ Data Blocks │
│ Block 0: Entries 0-99 │
│ Block 1: Entries 100-199 │
│ ... (more blocks) │
│ Block N: Entries 900-999 │
├──────────────────────────────┤
│ Index Block │
│ - Block 0 key range │
│ - Block 1 key range │
│ - ... │
└──────────────────────────────┘ Advantages over WAL:
- • Sorted keys enable range queries
- • Index allows binary search
- • Immutable (no overwrites)
- • Easy to delete entire file
Multi-level Compaction
Compaction Strategy:
Level 0 (Size: 4MB, Files: 4)
├─ sstable-0.db (1MB)
├─ sstable-1.db (1MB)
├─ sstable-2.db (1MB)
└─ sstable-3.db (1MB)
When adding 5th file → Compact to Level 1:
└─ Merge all 4 files + new file
└─ Write to Level 1 (5MB)
Level 1 (Size: 10MB, Files: 2)
├─ sstable-0-1.db (5MB)
└─ sstable-1-1.db (5MB)
When Level 1 exceeds 100MB → Compact to Level 2:
└─ Merge overlapping files
└─ Write to Level 2 (100MB+)
Pattern:
├─ Level 0: Small, recent data
├─ Level 1: Larger, older data
├─ Level 2: Large, very old data
└─ Each level ~10x larger than previous Read and Write Amplification
Write Amplification
How many times data is written:
Without LSM (direct to disk):
Write amplification: 1x
(data written once)
With LSM (simplified):
Level 0: Write once
Compact to L1: Rewrite once
Compact to L2: Rewrite again
Total: ~3-5x
Better with good compaction strategy:
└─ Leveled compaction: ~10x
└─ Tiered compaction: ~2-5x Read Amplification
How many files checked per read:
Without structure (WAL):
Read amplification: N (scan all)
With LSM:
Read key "user:500":
1. Check memtable (instant)
2. Check Level 0 (~4 files, sorted)
3. Use index/bloom filter
4. Check Level 1 (1 file via binary search)
Total: ~5-10 seeks (vs scanning N files) Trade-off:
LSM trades write efficiency (multiple writes) for read efficiency (fewer files to check).
Lesson 5.2: Memtable Implementation
Skip List Data Structure
We need a sorted, concurrent data structure for the memtable.
Option 1: Red-Black Tree
50
/ \
30 70
/ \
20 40 - • Balanced: O(log n)
- • Sorted: Yes
- • Complex rebalancing
Option 2: Skip List
Level 3: 17 ────────────────────┐
│
Level 2: 10 ────── 17 ──────── 26 ────┤
│ │
Level 1: 6 ── 10 ── 17 ── 26 ── 42 ┤ │
│ │ │ │ │ │ │
Level 0: 6 ── 10 ── 17 ── 26 ── 42 ── 73 ┴─ 99 - • Probabilistic: O(log n) average
- • Sorted: Yes
- • Lock-free variants possible
- • Simple to implement
Skip List Implementation
package storage
import (
"math"
"math/rand"
"sync"
)
const (
MaxLevel = 16
pFactor = 0.5
)
type SkipNode struct {
key []byte
value []byte
next []*SkipNode
}
type SkipList struct {
head *SkipNode
level int
mu sync.RWMutex
size int64
}
// NewSkipList creates empty skip list
func NewSkipList() *SkipList {
head := &SkipNode{
key: nil,
next: make([]*SkipNode, MaxLevel),
}
return &SkipList{
head: head,
level: 1,
size: 0,
}
}
// randomLevel generates random level for new node
func randomLevel() int {
level := 1
for level < MaxLevel && rand.Float64() < pFactor {
level++
}
return level
}
// Put inserts or updates a key
func (sl *SkipList) Put(key, value []byte) {
sl.mu.Lock()
defer sl.mu.Unlock()
level := randomLevel()
update := make([]*SkipNode, MaxLevel)
current := sl.head
// Find update nodes at each level
for i := sl.level - 1; i >= 0; i-- {
for current.next[i] != nil &&
bytesLess(current.next[i].key, key) {
current = current.next[i]
}
update[i] = current
}
// Check if key exists
if current.next[0] != nil && bytesEqual(current.next[0].key, key) {
current.next[0].value = value
return
}
// Create new node
newNode := &SkipNode{
key: key,
value: value,
next: make([]*SkipNode, level),
}
// Insert at each level
for i := 0; i < level; i++ {
newNode.next[i] = update[i].next[i]
update[i].next[i] = newNode
}
// Update level if needed
if level > sl.level {
for i := sl.level; i < level; i++ {
sl.head.next[i] = newNode
}
sl.level = level
}
sl.size++
} Flush Triggers and Thresholds
// Memtable with auto-flush
type Memtable struct {
active *SkipList
immutable *SkipList
mu sync.RWMutex
maxSize int64
currentSize int64
flushCh chan *SkipList
}
// Put with auto-flush
func (m *Memtable) Put(key, value []byte) error {
m.mu.Lock()
// Check if should flush
newSize := m.currentSize + int64(len(key)+len(value))
if newSize > m.maxSize {
// Swap memtables
m.immutable = m.active
m.active = NewSkipList()
m.currentSize = 0
// Signal flush
select {
case m.flushCh <- m.immutable:
default:
// Flush queue full, apply backpressure
m.mu.Unlock()
m.flushCh <- m.immutable
m.mu.Lock()
}
}
m.active.Put(key, value)
m.currentSize = newSize
m.mu.Unlock()
return nil
}
// Get checks both memtables
func (m *Memtable) Get(key []byte) ([]byte, bool) {
m.mu.RLock()
defer m.mu.RUnlock()
// Check active
if v, ok := m.active.Get(key); ok {
return v, ok
}
// Check immutable
if v, ok := m.immutable.Get(key); ok {
return v, ok
}
return nil, false
} Background Flushing Goroutines
// StartFlusher starts background flushing
func (db *DB) StartFlusher() {
go func() {
for sl := range db.memtable.flushCh {
if err := db.flushMemtable(sl); err != nil {
log.Printf("flush error: %v", err)
}
}
}()
}
// flushMemtable converts memtable to SSTable
func (db *DB) flushMemtable(mt *SkipList) error {
// Create SSTable writer
sstPath := filepath.Join(db.dataDir,
fmt.Sprintf("sstable-%d.sst", time.Now().UnixNano()))
writer, err := NewSSTWriter(sstPath)
if err != nil {
return err
}
defer writer.Close()
// Iterate and write all entries
iter := mt.NewIterator()
for iter.Next() {
writer.Add(iter.Key(), iter.Value())
}
return writer.Finish()
} Hands-on Lab: Build memtable with automatic flushing
Implement a complete memtable system with Skip Lists and background flushing.
Lab Tasks:
- 1. Implement Skip List data structure
- 2. Build dual memtable system (active + immutable)
- 3. Add automatic flush triggers
- 4. Implement background flushing goroutine
- 5. Create SSTable writer for persistence
- 6. Write comprehensive tests
- 7. Benchmark concurrent performance
Summary & Learning Outcomes
What You've Learned:
- • LSM Tree architecture and multi-level design
- • Skip List implementation for sorted memtables
- • SSTable format and advantages over WAL
- • Dual memtable design for concurrent access
- • Background flushing mechanisms
- • Read/write amplification trade-offs
Key Takeaways
- • LSM Trees optimize for writes while maintaining read efficiency
- • Skip Lists provide O(log n) sorted operations with simple implementation
- • Dual memtables eliminate read/write contention
- • Background flushing keeps writes non-blocking
- • SSTables enable efficient range queries and indexing