Bitmap Indexes: Crushing Analytical Queries With Bitwise Math
Core concept
A bitmap index represents each distinct value in a column as a bit-array (a sequence of 0s and 1s, one per row), where a 1 means "this row has this value." To answer a filter like WHERE gender = 'F' AND country = 'DE', the database ANDs the two bit-arrays together in a single CPU instruction sweep — no row-by-row comparison needed. This makes bitmap indexes blindingly fast for low-cardinality columns (columns with few distinct values, like status, country, or boolean flags) because the bit-arrays stay compact and the bitwise operations are CPU-native. They are the secret weapon behind most analytical (OLAP — systems optimized for complex read queries over large datasets) databases.
Diagram
flowchart TD
A[Column: Status] --> B["Bitmap: 'Active' → 1 0 1 1 0"]
A --> C["Bitmap: 'Inactive' → 0 1 0 0 1"]
D[Column: Country] --> E["Bitmap: 'DE' → 1 1 0 1 0"]
B --> F["AND → 1 0 0 1 0"]
E --> F
F --> G[Matching Rows: 1 and 4]
Concrete real-world example
Imagine a retail data warehouse (a database tuned for analytics, not live transactions) with 500 million order rows. A marketing analyst queries: "Show me all active customers in Germany who bought in Q1." The columns status, country, and quarter each have very few distinct values. With bitmap indexes, the database fetches three compact bit-arrays and ANDs them together — processing 500 million rows in milliseconds using SIMD (CPU instructions that operate on many bits at once) instructions. Apache Druid (a real-time analytics database) and Oracle (a widely-used enterprise database) both use this technique in their analytical engines.
One trade-off / gotcha
Bitmap indexes are catastrophic for high-cardinality columns (many distinct values, like user_id or email). A table with 10 million unique user IDs would need 10 million separate bit-arrays, each nearly as long as the table itself — the index would consume more memory than the data and slow down every write, because inserting or updating one row means flipping bits across potentially millions of arrays.
An interview-style question to ponder
Your team is designing an analytics service for a ride-sharing app. You have a rides table with 2 billion rows. Columns include city (300 distinct values), vehicle_type (4 values), status (3 values), and driver_id (5 million distinct values). A product manager wants sub-second query latency for dashboards filtering by city, vehicle type, and status. Where would you apply bitmap indexes, and what would you do about driver_id queries?
Stuck? Show a hint
Start by separating columns by how many distinct values they have — the cardinality tells you which indexing strategy fits each one, and there's a well-known complementary index type that handles the high end.
Show answer
Apply bitmap indexes to city, vehicle_type, and status; use a B-tree index (a sorted tree structure for efficient range and point lookups) on driver_id.
- Bitmap indexes shine here:
vehicle_type(4 values) andstatus(3 values) produce tiny bit-arrays; evencitywith 300 values is manageable. ANDing all three for a dashboard filter touches maybe a few kilobytes of bit data across 2 billion rows — that's your sub-second path. driver_idwith 5 million distinct values would produce 5 million bit-arrays each 2 billion bits long (~250 MB each) — the index alone would exceed 1 petabyte, which is obviously unworkable. A B-tree handlesdriver_idlookups in O(log n) time with a tiny index footprint.- But why not just use B-trees for everything? A B-tree filter on
citystill has to traverse a tree and fetch row pointers one at a time; combining two B-tree filters requires intersecting large sorted lists of row IDs, which is far slower than a single bitwise AND that a modern CPU executes at ~32 billion bits per second. - Watch out: bitmap indexes carry heavy write amplification — every INSERT or UPDATE must rewrite multiple bit-arrays, so they belong in batch-loaded warehouses, not high-throughput transactional tables.