VACUUM, autovacuum and table bloat

PostgreSQL never updates a row in place. Every UPDATE writes a new row version and marks the old one dead; every DELETE just marks. The space is reclaimed later, asynchronously, by VACUUM. When that loop works, you never think about it. When it falls behind, tables and indexes silently grow — that's bloat — and every query pays for it.

Why dead rows exist at all: MVCC

Under MVCC (multi-version concurrency control), readers never block writers and vice versa, because each transaction sees a consistent snapshot: old row versions are kept around as long as some transaction might still need them. The moment no active transaction can see a dead version, it becomes garbage — but PostgreSQL doesn't reclaim it inline. That's VACUUM's job.

Two immediate consequences:

What VACUUM actually does (and doesn't)

How autovacuum decides to run

The autovacuum launcher periodically checks each table against a threshold:

vacuum when:  n_dead_tup  >  autovacuum_vacuum_threshold
                             + autovacuum_vacuum_scale_factor × reltuples
-- defaults:  50 + 0.20 × reltuples  (20% of the table)

The default 20% scale factor is fine for small tables and terrible for big ones: a 100-million-row table accumulates 20 million dead rows before autovacuum even starts. For large, hot tables, set a per-table override:

ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_vacuum_threshold    = 10000
);

Autovacuum is also deliberately throttled by cost-based delays (autovacuum_vacuum_cost_delay / autovacuum_vacuum_cost_limit) so it doesn't saturate I/O. On modern hardware the defaults are conservative; if autovacuum runs constantly but never catches up, raising autovacuum_vacuum_cost_limit is usually the first lever.

Detecting bloat before it hurts

SELECT relname,
       n_live_tup,
       n_dead_tup,
       last_autovacuum,
       last_autoanalyze
FROM   pg_stat_user_tables
ORDER  BY n_dead_tup DESC
LIMIT  20;

HOT updates: the free lunch worth engineering for

If an UPDATE changes only columns that are not indexed, and the new version fits on the same page, PostgreSQL performs a HOT (heap-only tuple) update: no index entries are written at all, and the dead version can be cleaned up cheaply within the page. This is why adding "just one more index" on a frequently updated column can wreck write performance — it turns every HOT update into a full one. A lower fillfactor (e.g. 80–90 for update-heavy tables) leaves room on each page and raises the HOT rate.

Rules of thumb

Suspect bloat is slowing a query? Run it with EXPLAIN (ANALYZE, BUFFERS) and drop the output into the EXPLAIN Visualizer to see which node is doing the excess reading.