← all lessons
System Design · Storage & Indexing ·

Skip Lists: Probabilistic Indexing Without Tree Rebalancing

- Core concept — A skip list is a layered linked list where each element can appear on multiple "express lanes," letting you skip over many nodes at once during a search. The bottom layer holds every element in sorted order; each layer above is a random subset of the one below, roughly halving the number of nodes per layer. Searching starts at the top (sparse, fast) layer and drops down when it overshoots, reaching any element in O(log n) time on average — the same as a balanced binary tree (a tree where every subtree is roughly equal in size). The magic is that you never need to rebalance: when you insert a node, you flip a coin repeatedly to decide how many layers it should join, keeping the structure probabilistically balanced without any rotations or complex bookkeeping.

- Diagram

flowchart TD
    A["Layer 2 (sparse): 3 → 21"] -->|drop down| B
    B["Layer 1 (medium): 3 → 9 → 21"] -->|drop down| C
    C["Layer 0 (full): 3 → 6 → 9 → 14 → 21 → 25"]
    D["Search: 14"] -->|start top, skip to 21, too far| A
    A -->|drop to Layer 1, skip to 9| B
    B -->|drop to Layer 0, walk to 14| C

- Concrete real-world example — Redis (an in-memory key-value store) uses a skip list as the backbone of its sorted set data type (a collection of items each with a numeric score). When you call ZRANGE to fetch the top 10 leaderboard entries, Redis walks the skip list from the highest score downward, collecting results in O(log n + k) time where k is the number of results returned. Redis chose a skip list here over a balanced BST (binary search tree, a sorted tree structure) precisely because range scans — reading a contiguous slice of sorted data — are trivially a pointer walk along the bottom layer, and concurrent access is easier to implement correctly without complex rebalancing locks.

- One trade-off / gotcha — A skip list's O(log n) performance is probabilistic, not guaranteed. If the coin flips go badly during many consecutive inserts, a node might appear in too many or too few layers, degrading performance toward O(n) in the worst case. In practice this is vanishingly rare (the probability of O(n) behavior drops exponentially), but it means you can't give a hard real-time latency bound the way a deterministic B-tree (a tree that always stays balanced by splitting nodes) can.

- An interview-style question to ponder — A system stores 50 million user events sorted by timestamp. Engineers debate using a skip list versus a B-tree index for this column. The workload is 80% range scans ("give me all events between time A and time B") and 20% point lookups. Which structure would you lean toward and why?

Stuck? Show a hint

Think about where each structure's real cost lives: it's not just lookup speed, but what happens physically when you read a range of values, and how often the structure has to "fix itself" after an insert.

Show answer

For this workload, lean toward a B-tree index.

  • B-trees store data in contiguous pages on disk (blocks of data read together in one I/O operation), so a range scan reads physically sequential pages — one disk read can pull in hundreds of sorted records at once. Skip lists store nodes as individually allocated heap objects scattered in memory or on disk, so a range scan is a pointer-chase through potentially non-contiguous locations, multiplying I/O.
  • With 80% of queries being range scans over 50 million rows, that sequential access advantage compounds heavily — benchmarks often show 5–10× fewer disk reads for range queries on a B-tree versus a skip list backed to disk.
  • B-trees also give deterministic O(log n) worst-case guarantees, which matters for predictable p99 latency (the slowest 1-in-100 request) under a heavy write burst that might skew coin flips in a skip list.
  • But why not just use a skip list since it avoids rebalancing overhead? Rebalancing in a B-tree only splits or merges pages at the leaf level and rarely propagates upward; in practice the write amplification is modest and the range-scan locality benefit far outweighs it for disk-backed workloads. Skip lists shine when your data lives entirely in memory and you need simple concurrent writes — exactly Redis's situation.
  • Watch out: if this workload moves to an LSM-tree-backed store (a storage engine that batches writes into sorted files), a skip list is actually used internally as the in-memory write buffer before data is flushed — so the two aren't always in competition.