SQL Indexing Fundamentals: When (and What) to Index

Indexes turn linear table scans into logarithmic lookups. The cost is extra storage, slower writes, and one more thing to maintain. Knowing exactly when that trade-off is worth it is what separates someone who writes SQL from someone who runs a database.

What an index actually is

An index is a separate data structure (almost always a B-tree) that maps indexed values to the row locations in the underlying table. When you query WHERE email = '...' on an indexed email column, the database walks the B-tree in O(log n) time instead of scanning every row in O(n).

For a table with 10 million rows: a B-tree index lookup takes roughly 24 comparisons; a full table scan reads all 10M rows. Even on fast SSDs, that's the difference between 1 millisecond and 5 seconds.

The four cases you almost always want to index

Foreign keys. Most databases don't auto-index foreign keys. Without an index, JOIN operations and parent-side cascades go quadratic.
Columns in WHERE clauses on tables with >10K rows where the column has reasonable selectivity (returns <5% of rows).
ORDER BY columns when you don't want to sort the entire result set in memory.
UNIQUE constraints. Most databases auto-create an index for these — don't rely on it, confirm.

Selectivity: the core concept

Selectivity = (distinct values) / (total rows). High selectivity means a query returns few rows; low selectivity means many.

Highly selective (good for indexing): email, user_id, order_number. Indexing pays off.
Mildly selective: category_id (10–100 distinct values). Useful in composite indexes.
Low selectivity (rarely worth indexing alone): boolean is_active, status enum with 3 values. Index lookup may be slower than table scan because the database has to load both the index and many rows from the table.

Rule of thumb: if a query returns more than 5–10% of the table, the optimizer often skips the index even if you have one — full scan with sequential I/O wins.

Read EXPLAIN, then read it again

Run EXPLAIN ANALYZE (PostgreSQL) or EXPLAIN (MySQL) on every important query. Look for "Seq Scan" / "Full Table Scan" — that's where indexes are missing or being ignored. "Index Scan" / "ref" means the index is in use.

Composite indexes: order matters

A composite index on (column_a, column_b) can answer queries on column_a alone OR (column_a AND column_b), but NOT column_b alone. Think of it like a phone book sorted by last_name, then first_name — efficient for looking up "Smith" or "Smith, John" but useless for "find all the Johns."

CREATE INDEX idx_orders_user_status ON orders (user_id, status); -- Uses the index efficiently: SELECT * FROM orders WHERE user_id = 42; SELECT * FROM orders WHERE user_id = 42 AND status = 'pending'; -- Does NOT use the index (status is the second column): SELECT * FROM orders WHERE status = 'pending';

Pick column order by selectivity (most selective first) and by query patterns (most-used columns first).

Covering indexes

A covering index includes all the columns a query needs, so the database can answer the query from the index alone without touching the underlying table. Massive speedup when the table rows are large.

-- Postgres: include columns in the index CREATE INDEX idx_orders_user_covered ON orders (user_id) INCLUDE (status, total); -- This query is now answered entirely from the index: SELECT user_id, status, total FROM orders WHERE user_id = 42;

Costs that bite

Every index has costs. They are usually invisible until they're not.

Storage: each index roughly equals the size of the indexed columns × number of rows. A big table with five indexes can use 5–10x its base storage.
Write amplification: every INSERT/UPDATE/DELETE has to update every applicable index. A table with 10 indexes has 11 writes per logical write.
Bloat: in PostgreSQL, deleted rows leave dead tuples in indexes that are reclaimed only by VACUUM. Heavy update workloads need maintenance attention.
Plan complexity: with many indexes, the query planner can pick suboptimal ones. Sometimes dropping a "useful-looking" index makes a critical query faster.

What NOT to index

Tables under ~1000 rows. Full scans are essentially free at that size.
Columns with very low selectivity (booleans, 3-value enums) on their own.
Columns you write to constantly but rarely query by. Pure overhead.
Wide string columns like long URLs or descriptions. Indexing eats storage. If you must index, hash the column and index the hash, or use partial indexes.

Special index types worth knowing

Partial index: indexes only the rows matching a WHERE clause. Useful when most queries target a subset (e.g., "WHERE deleted_at IS NULL").
Expression index: indexes the result of a function call. Lets WHERE LOWER(email) = ? use an index.
GIN / GiST (Postgres): for full-text search, JSONB columns, arrays.
Hash index: O(1) lookup but only supports equality. Most use cases prefer B-trees.
BRIN (Postgres): tiny indexes for huge tables where data is naturally ordered (time-series). Trade some lookup speed for massive storage savings.

The audit process

Every six months, audit your indexes:

Pull stats on which indexes are being scanned (PostgreSQL: pg_stat_user_indexes).
Drop indexes with zero scans over the last 30 days. They're paying write costs for no read benefit.
Look for foreign keys without indexes — query information_schema to find them.
Look for redundant indexes: an index on (a) is redundant if you have an index on (a, b).
Run EXPLAIN on slow-query log entries; add or restructure indexes accordingly.

Add one at a time, measure each

It's tempting to "index everything" while you're tuning. Don't. Add one index, run your slow queries, measure the difference, then decide whether the next one is worth it. Half the indexes you think you need turn out to be unused or counterproductive when you measure.

Key Takeaways

Always index foreign keys, columns in WHERE clauses on big tables, and ORDER BY columns.
Selectivity matters: low-cardinality columns rarely benefit from a single-column index.
Composite indexes follow leftmost-prefix rules — column order is everything.
Every index taxes writes. Audit unused indexes every 6 months and drop the dead weight.
Use EXPLAIN ANALYZE to confirm the index is actually being used — never assume.