SQL Indexing Fundamentals: When (and What) to Index
Indexes are the single biggest performance lever in SQL. They're also the most over- and under-applied. Here's a practical framework for deciding what to index and what to leave alone.
Indexes turn linear table scans into logarithmic lookups. The cost is extra storage, slower writes, and one more thing to maintain. Knowing exactly when that trade-off is worth it is what separates someone who writes SQL from someone who runs a database.
What an index actually is
An index is a separate data structure (almost always a B-tree) that maps indexed values to the row locations in the underlying table. When you query WHERE email = '...' on an indexed email column, the database walks the B-tree in O(log n) time instead of scanning every row in O(n).
For a table with 10 million rows: a B-tree index lookup takes roughly 24 comparisons; a full table scan reads all 10M rows. Even on fast SSDs, that's the difference between 1 millisecond and 5 seconds.
The four cases you almost always want to index
- Foreign keys. Most databases don't auto-index foreign keys. Without an index, JOIN operations and parent-side cascades go quadratic.
- Columns in WHERE clauses on tables with >10K rows where the column has reasonable selectivity (returns <5% of rows).
- ORDER BY columns when you don't want to sort the entire result set in memory.
- UNIQUE constraints. Most databases auto-create an index for these — don't rely on it, confirm.
Selectivity: the core concept
Selectivity = (distinct values) / (total rows). High selectivity means a query returns few rows; low selectivity means many.
- Highly selective (good for indexing): email, user_id, order_number. Indexing pays off.
- Mildly selective: category_id (10–100 distinct values). Useful in composite indexes.
- Low selectivity (rarely worth indexing alone): boolean is_active, status enum with 3 values. Index lookup may be slower than table scan because the database has to load both the index and many rows from the table.
Rule of thumb: if a query returns more than 5–10% of the table, the optimizer often skips the index even if you have one — full scan with sequential I/O wins.
Read EXPLAIN, then read it again
EXPLAIN ANALYZE (PostgreSQL) or EXPLAIN (MySQL) on every important query. Look for "Seq Scan" / "Full Table Scan" — that's where indexes are missing or being ignored. "Index Scan" / "ref" means the index is in use.Composite indexes: order matters
A composite index on (column_a, column_b) can answer queries on column_a alone OR (column_a AND column_b), but NOT column_b alone. Think of it like a phone book sorted by last_name, then first_name — efficient for looking up "Smith" or "Smith, John" but useless for "find all the Johns."
Pick column order by selectivity (most selective first) and by query patterns (most-used columns first).
Covering indexes
A covering index includes all the columns a query needs, so the database can answer the query from the index alone without touching the underlying table. Massive speedup when the table rows are large.
Costs that bite
Every index has costs. They are usually invisible until they're not.
- Storage: each index roughly equals the size of the indexed columns × number of rows. A big table with five indexes can use 5–10x its base storage.
- Write amplification: every INSERT/UPDATE/DELETE has to update every applicable index. A table with 10 indexes has 11 writes per logical write.
- Bloat: in PostgreSQL, deleted rows leave dead tuples in indexes that are reclaimed only by VACUUM. Heavy update workloads need maintenance attention.
- Plan complexity: with many indexes, the query planner can pick suboptimal ones. Sometimes dropping a "useful-looking" index makes a critical query faster.
What NOT to index
- Tables under ~1000 rows. Full scans are essentially free at that size.
- Columns with very low selectivity (booleans, 3-value enums) on their own.
- Columns you write to constantly but rarely query by. Pure overhead.
- Wide string columns like long URLs or descriptions. Indexing eats storage. If you must index, hash the column and index the hash, or use partial indexes.
Special index types worth knowing
- Partial index: indexes only the rows matching a WHERE clause. Useful when most queries target a subset (e.g., "WHERE deleted_at IS NULL").
- Expression index: indexes the result of a function call. Lets
WHERE LOWER(email) = ?use an index. - GIN / GiST (Postgres): for full-text search, JSONB columns, arrays.
- Hash index: O(1) lookup but only supports equality. Most use cases prefer B-trees.
- BRIN (Postgres): tiny indexes for huge tables where data is naturally ordered (time-series). Trade some lookup speed for massive storage savings.
The audit process
Every six months, audit your indexes:
- Pull stats on which indexes are being scanned (PostgreSQL:
pg_stat_user_indexes). - Drop indexes with zero scans over the last 30 days. They're paying write costs for no read benefit.
- Look for foreign keys without indexes — query
information_schemato find them. - Look for redundant indexes: an index on (a) is redundant if you have an index on (a, b).
- Run EXPLAIN on slow-query log entries; add or restructure indexes accordingly.
Add one at a time, measure each
Key Takeaways
- Always index foreign keys, columns in WHERE clauses on big tables, and ORDER BY columns.
- Selectivity matters: low-cardinality columns rarely benefit from a single-column index.
- Composite indexes follow leftmost-prefix rules — column order is everything.
- Every index taxes writes. Audit unused indexes every 6 months and drop the dead weight.
- Use EXPLAIN ANALYZE to confirm the index is actually being used — never assume.