VACUUM, autovacuum and table bloat
PostgreSQL never updates a row in place. Every UPDATE writes a new row version and marks the old one dead; every DELETE just marks. The space is reclaimed later, asynchronously, by VACUUM. When that loop works, you never think about it. When it falls behind, tables and indexes silently grow — that's bloat — and every query pays for it.
Why dead rows exist at all: MVCC
Under MVCC (multi-version concurrency control), readers never block writers and vice versa, because each transaction sees a consistent snapshot: old row versions are kept around as long as some transaction might still need them. The moment no active transaction can see a dead version, it becomes garbage — but PostgreSQL doesn't reclaim it inline. That's VACUUM's job.
Two immediate consequences:
- An
UPDATE-heavy workload writes roughly as much as anINSERT-heavy one, plus index churn. - A single long-running transaction holds back cleanup for the whole database — VACUUM cannot remove any row version that transaction's snapshot might still see, regardless of which tables it touched. A forgotten
idle in transactionsession on Friday means a bloated database on Monday.
What VACUUM actually does (and doesn't)
- Plain
VACUUMscans for dead tuples, makes their space reusable for future writes in the same table, updates the free space map and the visibility map, and freezes old tuples to prevent transaction-ID wraparound. It runs alongside normal traffic. - It does not shrink the file on disk (except for the special case of completely empty pages at the very end of the table). A table that once bloated to 50 GB stays 50 GB on disk even if it's 90% empty inside.
VACUUM FULLrewrites the table into a new compact file and returns space to the OS — but takes anACCESS EXCLUSIVElock for the whole rewrite. On a big, busy table that's an outage; look at pg_repack for online rebuilds.ANALYZE(often run together) refreshes planner statistics — a different job that also matters, see reading EXPLAIN ANALYZE.
How autovacuum decides to run
The autovacuum launcher periodically checks each table against a threshold:
vacuum when: n_dead_tup > autovacuum_vacuum_threshold
+ autovacuum_vacuum_scale_factor × reltuples
-- defaults: 50 + 0.20 × reltuples (20% of the table)
The default 20% scale factor is fine for small tables and terrible for big ones: a 100-million-row table accumulates 20 million dead rows before autovacuum even starts. For large, hot tables, set a per-table override:
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 10000
);
Autovacuum is also deliberately throttled by cost-based delays (autovacuum_vacuum_cost_delay / autovacuum_vacuum_cost_limit) so it doesn't saturate I/O. On modern hardware the defaults are conservative; if autovacuum runs constantly but never catches up, raising autovacuum_vacuum_cost_limit is usually the first lever.
Detecting bloat before it hurts
SELECT relname,
n_live_tup,
n_dead_tup,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
n_dead_tuplarge and growing, withlast_autovacuumold or NULL → autovacuum isn't keeping up (or a long transaction is pinning the horizon: checkpg_stat_activityfor oldxact_start).- Query-side symptom: buffer counts out of proportion to rows returned in
EXPLAIN (ANALYZE, BUFFERS)— reading 40,000 pages for 500 rows means scanning mostly dead space. - Size-side check: compare
pg_total_relation_size()over time, or use thepgstattupleextension for an exact dead-space percentage on a suspect table.
HOT updates: the free lunch worth engineering for
If an UPDATE changes only columns that are not indexed, and the new version fits on the same page, PostgreSQL performs a HOT (heap-only tuple) update: no index entries are written at all, and the dead version can be cleaned up cheaply within the page. This is why adding "just one more index" on a frequently updated column can wreck write performance — it turns every HOT update into a full one. A lower fillfactor (e.g. 80–90 for update-heavy tables) leaves room on each page and raises the HOT rate.
Rules of thumb
- Never disable autovacuum. If it hurts, it's under-tuned, not unnecessary — and it also protects you from transaction-ID wraparound, which is a database-stopping event.
- Kill or timeout idle transactions:
idle_in_transaction_session_timeoutis cheap insurance. - Tune per-table, not globally: a handful of hot tables usually cause most of the pain.
- Bloat you already have won't go away on its own: plain VACUUM stops the growth; reclaiming disk needs
VACUUM FULLor an online rebuild.
EXPLAIN (ANALYZE, BUFFERS) and drop the output into the EXPLAIN Visualizer to see which node is doing the excess reading.
