SSTables and LSM Trees

Learning Objectives

• Understand LSM Tree architecture and its benefits
• Implement Skip Lists for sorted memtables
• Build Sorted String Tables (SSTables) for disk storage
• Master dual memtable design for concurrent access
• Implement background flushing mechanisms

Lesson 5.1: LSM Tree Architecture

The Problem with Append-Only Logs

Current system has limitations that LSM Trees solve:

Current System Issues:

• Data stored in memory (HashMap) - not sorted
• All writes logged to disk (WAL) - not indexed
• No efficient range queries
• Recovery must replay entire log

In-Memory:
  "user:1" → "alice"
  "user:2" → "bob"
  "user:3" → "charlie"
  (randomly ordered)

To find all keys between "user:1" and "user:3":
  └─ Must scan all keys (no index)
  └─ O(n) operation

To persist to disk efficiently:
  └─ WAL is append-only, not indexed
  └─ Recovery must replay entire log

LSM Tree Overview

LSM = Log-Structured Merge Tree
The key insight: Optimize for writes, not reads

┌────────────────────────────────────┐
│         In-Memory (Memtable)       │
│    Sorted structure (SkipList)     │
│  Size limit: 2MB                   │
└────────────────────────────────────┘
           │
           │ (When full)
           ▼
┌────────────────────────────────────┐
│    Level 0: SSTables (on disk)     │
│  Multiple small files (0-4)        │
└────────────────────────────────────┘
           │
           │ (Compaction)
           ▼
┌────────────────────────────────────┐
│    Level 1: SSTables (on disk)     │
│  Fewer, larger files (10-20)       │
└────────────────────────────────────┘
           │
           │ (Compaction)
           ▼
┌────────────────────────────────────┐
│    Level 2: SSTables (on disk)     │
│  Even fewer, even larger files     │
└────────────────────────────────────┘

Memtable and Immutable Memtable

Current problem: Single memtable causes contention between reads and writes.

Solution: Dual memtables

Writer goroutines:
  └─ Write to Active memtable

Reader goroutines:
  ├─ Search Active memtable
  └─ Search Immutable memtable

Flusher goroutine (background):
  ├─ When Active full
  ├─ Swap: Active → Immutable
  ├─ Create new Active
  └─ Flush Immutable to disk

Timeline:
t0: Active = MT0 (empty), Immutable = none
t1: Writes to MT0
t2: MT0 full → Immutable = MT0, Active = MT1 (new)
t3: Writes to MT1
t4: Flusher: Flush Immutable to disk
t5: Immutable = MT1, Active = MT2

Benefits:

• Writes don't block on disk I/O
• Reads can search both memtables
• Flushing happens in background

Sorted String Tables (SSTables)

What is an SSTable? A file on disk containing sorted key-value pairs.

SSTable format:
┌──────────────────────────────┐
│ Header                       │
│ - Version: 1                 │
│ - EntryCount: 1000           │
│ - MinKey: "user:100"         │
│ - MaxKey: "user:999"         │
│ - BloomFilter: (optional)    │
├──────────────────────────────┤
│ Data Blocks                  │
│ Block 0: Entries 0-99        │
│ Block 1: Entries 100-199     │
│ ... (more blocks)            │
│ Block N: Entries 900-999     │
├──────────────────────────────┤
│ Index Block                  │
│ - Block 0 key range          │
│ - Block 1 key range          │
│ - ...                        │
└──────────────────────────────┘

Advantages over WAL:

• Sorted keys enable range queries
• Index allows binary search
• Immutable (no overwrites)
• Easy to delete entire file

Multi-level Compaction

Compaction Strategy:

Level 0 (Size: 4MB, Files: 4)
├─ sstable-0.db (1MB)
├─ sstable-1.db (1MB)
├─ sstable-2.db (1MB)
└─ sstable-3.db (1MB)

When adding 5th file → Compact to Level 1:
  └─ Merge all 4 files + new file
  └─ Write to Level 1 (5MB)

Level 1 (Size: 10MB, Files: 2)
├─ sstable-0-1.db (5MB)
└─ sstable-1-1.db (5MB)

When Level 1 exceeds 100MB → Compact to Level 2:
  └─ Merge overlapping files
  └─ Write to Level 2 (100MB+)

Pattern:
├─ Level 0: Small, recent data
├─ Level 1: Larger, older data
├─ Level 2: Large, very old data
└─ Each level ~10x larger than previous

Read and Write Amplification

Write Amplification

How many times data is written:

Without LSM (direct to disk):
  Write amplification: 1x
  (data written once)

With LSM (simplified):
  Level 0: Write once
  Compact to L1: Rewrite once
  Compact to L2: Rewrite again
  Total: ~3-5x

Better with good compaction strategy:
  └─ Leveled compaction: ~10x
  └─ Tiered compaction: ~2-5x

Read Amplification

How many files checked per read:

Without structure (WAL):
  Read amplification: N (scan all)
  
With LSM:
  Read key "user:500":
  1. Check memtable (instant)
  2. Check Level 0 (~4 files, sorted)
  3. Use index/bloom filter
  4. Check Level 1 (1 file via binary search)
  
  Total: ~5-10 seeks (vs scanning N files)

Trade-off:

LSM trades write efficiency (multiple writes) for read efficiency (fewer files to check).

Lesson 5.2: Memtable Implementation

Skip List Data Structure

We need a sorted, concurrent data structure for the memtable.

Option 1: Red-Black Tree

• Balanced: O(log n)
• Sorted: Yes
• Complex rebalancing

Option 2: Skip List

Level 3:           17 ────────────────────┐
                                          │
Level 2:     10 ────── 17 ──────── 26 ────┤
                                    │      │
Level 1:  6 ── 10 ── 17 ── 26 ── 42 ┤     │
          │    │     │     │    │   │     │
Level 0:  6 ── 10 ── 17 ── 26 ── 42 ── 73 ┴─ 99

• Probabilistic: O(log n) average
• Sorted: Yes
• Lock-free variants possible
• Simple to implement

Skip List Implementation

package storage

import (
    "math"
    "math/rand"
    "sync"
)

const (
    MaxLevel = 16
    pFactor  = 0.5
)

type SkipNode struct {
    key   []byte
    value []byte
    next  []*SkipNode
}

type SkipList struct {
    head     *SkipNode
    level    int
    mu       sync.RWMutex
    size     int64
}

// NewSkipList creates empty skip list
func NewSkipList() *SkipList {
    head := &SkipNode{
        key:  nil,
        next: make([]*SkipNode, MaxLevel),
    }
    return &SkipList{
        head:  head,
        level: 1,
        size:  0,
    }
}

// randomLevel generates random level for new node
func randomLevel() int {
    level := 1
    for level < MaxLevel && rand.Float64() < pFactor {
        level++
    }
    return level
}

// Put inserts or updates a key
func (sl *SkipList) Put(key, value []byte) {
    sl.mu.Lock()
    defer sl.mu.Unlock()
    
    level := randomLevel()
    update := make([]*SkipNode, MaxLevel)
    current := sl.head
    
    // Find update nodes at each level
    for i := sl.level - 1; i >= 0; i-- {
        for current.next[i] != nil &&
            bytesLess(current.next[i].key, key) {
            current = current.next[i]
        }
        update[i] = current
    }
    
    // Check if key exists
    if current.next[0] != nil && bytesEqual(current.next[0].key, key) {
        current.next[0].value = value
        return
    }
    
    // Create new node
    newNode := &SkipNode{
        key:  key,
        value: value,
        next: make([]*SkipNode, level),
    }
    
    // Insert at each level
    for i := 0; i < level; i++ {
        newNode.next[i] = update[i].next[i]
        update[i].next[i] = newNode
    }
    
    // Update level if needed
    if level > sl.level {
        for i := sl.level; i < level; i++ {
            sl.head.next[i] = newNode
        }
        sl.level = level
    }
    
    sl.size++
}

Flush Triggers and Thresholds

// Memtable with auto-flush
type Memtable struct {
    active      *SkipList
    immutable   *SkipList
    mu          sync.RWMutex
    
    maxSize     int64
    currentSize int64
    
    flushCh     chan *SkipList
}

// Put with auto-flush
func (m *Memtable) Put(key, value []byte) error {
    m.mu.Lock()
    
    // Check if should flush
    newSize := m.currentSize + int64(len(key)+len(value))
    if newSize > m.maxSize {
        // Swap memtables
        m.immutable = m.active
        m.active = NewSkipList()
        m.currentSize = 0
        
        // Signal flush
        select {
        case m.flushCh <- m.immutable:
        default:
            // Flush queue full, apply backpressure
            m.mu.Unlock()
            m.flushCh <- m.immutable
            m.mu.Lock()
        }
    }
    
    m.active.Put(key, value)
    m.currentSize = newSize
    m.mu.Unlock()
    
    return nil
}

// Get checks both memtables
func (m *Memtable) Get(key []byte) ([]byte, bool) {
    m.mu.RLock()
    defer m.mu.RUnlock()
    
    // Check active
    if v, ok := m.active.Get(key); ok {
        return v, ok
    }
    
    // Check immutable
    if v, ok := m.immutable.Get(key); ok {
        return v, ok
    }
    
    return nil, false
}

Background Flushing Goroutines

// StartFlusher starts background flushing
func (db *DB) StartFlusher() {
    go func() {
        for sl := range db.memtable.flushCh {
            if err := db.flushMemtable(sl); err != nil {
                log.Printf("flush error: %v", err)
            }
        }
    }()
}

// flushMemtable converts memtable to SSTable
func (db *DB) flushMemtable(mt *SkipList) error {
    // Create SSTable writer
    sstPath := filepath.Join(db.dataDir, 
        fmt.Sprintf("sstable-%d.sst", time.Now().UnixNano()))
    
    writer, err := NewSSTWriter(sstPath)
    if err != nil {
        return err
    }
    defer writer.Close()
    
    // Iterate and write all entries
    iter := mt.NewIterator()
    for iter.Next() {
        writer.Add(iter.Key(), iter.Value())
    }
    
    return writer.Finish()
}

Hands-on Lab: Build memtable with automatic flushing

Implement a complete memtable system with Skip Lists and background flushing.

Lab Tasks:

1. Implement Skip List data structure
2. Build dual memtable system (active + immutable)
3. Add automatic flush triggers
4. Implement background flushing goroutine
5. Create SSTable writer for persistence
6. Write comprehensive tests
7. Benchmark concurrent performance

Summary & Learning Outcomes

What You've Learned:

• LSM Tree architecture and multi-level design
• Skip List implementation for sorted memtables
• SSTable format and advantages over WAL
• Dual memtable design for concurrent access
• Background flushing mechanisms
• Read/write amplification trade-offs

Key Takeaways

• LSM Trees optimize for writes while maintaining read efficiency
• Skip Lists provide O(log n) sorted operations with simple implementation
• Dual memtables eliminate read/write contention
• Background flushing keeps writes non-blocking
• SSTables enable efficient range queries and indexing

← Previous: Append-Only Log (Write-Ahead Log) Next: Compaction and Optimization →

Course Navigation

All Lessons