Skip to content

PostgreSQL 例行清理(Routine Vacuuming)学习笔记

📋 概述

PostgreSQL 的例行清理是数据库维护的核心组成部分。在高并发、高频率更新的业务场景中,如电商平台的订单系统、金融系统的交易记录等,理解和掌握清理机制对于保证数据库性能和稳定性至关重要。

本笔记将通过真实业务场景和完整示例,深入解析 PostgreSQL 的清理机制,帮助您在实际工作中有效管理数据库维护。

🎯 核心价值

PostgreSQL 的 VACUUM 命令解决了四个关键问题:

  • 磁盘空间管理:回收删除和更新操作产生的死行空间
  • 查询性能优化:维护统计信息,确保查询计划器做出最优决策
  • 索引效率提升:更新可见性映射,支持仅索引扫描
  • 数据安全保障:防止事务 ID 回绕导致的数据丢失

🔧 1. 清理基础原理

1.1 MVCC 与死行产生机制

PostgreSQL 使用多版本并发控制(MVCC)来处理并发事务。这意味着 UPDATE 和 DELETE 操作不会立即物理删除数据,而是创建新版本或标记为删除。

Syntax error in textmermaid version 11.8.0

1.2 业务场景示例:电商订单系统

假设我们有一个电商平台的订单表,每天处理大量订单状态更新:

sql
-- 创建订单表
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    order_status VARCHAR(20) DEFAULT 'pending',
    order_amount DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 插入测试数据
INSERT INTO orders (customer_id, order_amount)
SELECT
    (random() * 10000)::INTEGER,
    (random() * 1000 + 10)::DECIMAL(10,2)
FROM generate_series(1, 100000);

模拟高频更新场景

sql
-- 模拟订单状态变更(产生大量死行)
UPDATE orders
SET order_status = 'processing',
    updated_at = CURRENT_TIMESTAMP
WHERE order_status = 'pending'
AND order_id % 3 = 0;

UPDATE orders
SET order_status = 'shipped',
    updated_at = CURRENT_TIMESTAMP
WHERE order_status = 'processing'
AND order_id % 5 = 0;

UPDATE orders
SET order_status = 'completed',
    updated_at = CURRENT_TIMESTAMP
WHERE order_status = 'shipped'
AND order_id % 2 = 0;

查看表膨胀情况

sql
-- 查看表的页面使用情况
SELECT
    schemaname,
    tablename,
    attname,
    n_distinct,
    correlation
FROM pg_stats
WHERE tablename = 'orders';

-- 查看表大小和死行统计
SELECT
    relname,
    n_tup_ins,      -- 插入的行数
    n_tup_upd,      -- 更新的行数
    n_tup_del,      -- 删除的行数
    n_live_tup,     -- 活跃行数
    n_dead_tup,     -- 死行数
    round((n_dead_tup::float / (n_live_tup + n_dead_tup) * 100), 2) as dead_ratio
FROM pg_stat_user_tables
WHERE relname = 'orders';

分析输出示例

 relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup | dead_ratio
---------+-----------+-----------+-----------+------------+------------+------------
 orders  |    100000 |     45000 |         0 |     100000 |      45000 |      31.03

当死行比例超过 20% 时,会显著影响查询性能和存储效率,此时应考虑执行 VACUUM 操作。 :::

1.3 VACUUM 类型对比

类型执行速度空间回收锁级别适用场景
VACUUM快速标记可重用较低锁日常维护
VACUUM FULL缓慢完全回收排他锁严重膨胀

🗄️ 2. 磁盘空间回收策略

2.1 标准 VACUUM 的工作原理

Syntax error in textmermaid version 11.8.0

2.2 实践示例:清理订单表

执行标准清理

sql
-- 执行详细清理,查看处理过程
VACUUM (VERBOSE, ANALYZE) orders;

清理输出解析

INFO:  vacuuming "public.orders"
INFO:  "orders": removed 45000 row versions in 500 pages
INFO:  "orders": found 45000 removable, 100000 nonremovable row versions in 1000 pages
INFO:  "orders": truncated 1000 to 800 pages
DETAIL:  CPU: user: 0.15 s, system: 0.05 s, elapsed: 0.25 s

清理效果验证

sql
-- 清理前后对比查询
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as data_size,
    n_dead_tup,
    round((n_dead_tup::float / (n_live_tup + n_dead_tup) * 100), 2) as dead_ratio
FROM pg_stat_user_tables t
JOIN pg_tables pt ON t.relname = pt.tablename
WHERE t.relname = 'orders';

2.3 业务场景决策矩阵

场景死行比例表大小推荐策略执行频率
订单系统10-20%< 10GB标准 VACUUM每日
日志表> 50%> 50GBVACUUM FULL周末维护窗口
配置表< 5%< 1GB自动清理系统自动
历史归档> 80%任意TRUNCATE + 重建按需

2.4 空间回收策略实现

监控脚本

sql
-- 创建监控视图,识别需要清理的表
CREATE OR REPLACE VIEW table_bloat_monitor AS
SELECT
    schemaname,
    tablename,
    n_live_tup,
    n_dead_tup,
    round((n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1) * 100), 2) as dead_ratio,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size,
    last_vacuum,
    last_autovacuum,
    CASE
        WHEN n_dead_tup > 10000 AND (n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1)) > 0.2
        THEN 'URGENT'
        WHEN n_dead_tup > 5000 AND (n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1)) > 0.1
        THEN 'MEDIUM'
        ELSE 'LOW'
    END as priority
FROM pg_stat_user_tables
ORDER BY dead_ratio DESC;

-- 查看需要清理的表
SELECT * FROM table_bloat_monitor WHERE priority IN ('URGENT', 'MEDIUM');

自动化清理脚本

sql
-- 批量清理高优先级表的存储过程
CREATE OR REPLACE FUNCTION auto_vacuum_tables()
RETURNS TEXT AS $$
DECLARE
    table_record RECORD;
    result_text TEXT := '';
BEGIN
    -- 遍历需要清理的表
    FOR table_record IN
        SELECT schemaname, tablename, dead_ratio
        FROM table_bloat_monitor
        WHERE priority = 'URGENT'
    LOOP
        -- 执行清理
        EXECUTE format('VACUUM (VERBOSE) %I.%I', table_record.schemaname, table_record.tablename);
        result_text := result_text || format('Vacuumed %s.%s (dead_ratio: %s%%) ',
                                           table_record.schemaname,
                                           table_record.tablename,
                                           table_record.dead_ratio) || E'\n';
    END LOOP;

    RETURN result_text;
END;
$$ LANGUAGE plpgsql;

-- 执行批量清理
SELECT auto_vacuum_tables();

生产环境最佳实践

  1. 监控优先:建立表膨胀监控机制,在问题恶化前主动处理
  2. 分时执行:在业务低峰期执行大表清理,避免影响正常业务
  3. 渐进策略:优先使用标准 VACUUM,仅在必要时使用 VACUUM FULL
  4. 空间预留:为 VACUUM FULL 预留足够的磁盘空间(至少表大小的 2 倍) :::

2.5 特殊场景处理

批量删除场景

sql
-- 场景:清理90天前的订单数据
-- 不推荐的做法(产生大量死行)
-- DELETE FROM orders WHERE created_at < CURRENT_DATE - INTERVAL '90 days';

-- 推荐的做法:使用 TRUNCATE 配合分区
-- 1. 如果有分区表,直接删除分区
-- 2. 如果没有分区,考虑分批删除 + 及时清理

-- 分批删除示例
DO $$
DECLARE
    deleted_count INTEGER;
BEGIN
    LOOP
        -- 每次删除1000条记录
        DELETE FROM orders
        WHERE order_id IN (
            SELECT order_id
            FROM orders
            WHERE created_at < CURRENT_DATE - INTERVAL '90 days'
            LIMIT 1000
        );

        GET DIAGNOSTICS deleted_count = ROW_COUNT;

        -- 如果删除数量为0,退出循环
        IF deleted_count = 0 THEN
            EXIT;
        END IF;

        -- 每删除1000条执行一次清理
        VACUUM orders;

        -- 稍作暂停,避免长时间占用资源
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

这样的分批处理策略可以:

  • 避免长时间锁表
  • 及时回收空间
  • 减少对业务的影响
  • 保持系统响应性

📊 3. 查询计划器统计信息更新

3.1 统计信息的重要性

PostgreSQL 查询计划器依赖统计信息来生成最优的执行计划。过时的统计信息会导致错误的成本估算,从而选择低效的查询路径。

Syntax error in textmermaid version 11.8.0

3.2 业务场景:电商商品搜索优化

场景描述:电商平台的商品表经常有价格更新、库存变动等操作,需要及时更新统计信息以保证搜索性能。

sql
-- 创建商品表
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    category_id INTEGER,
    product_name VARCHAR(200),
    price DECIMAL(10,2),
    stock_quantity INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 创建索引
CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_products_price ON products(price);
CREATE INDEX idx_products_stock ON products(stock_quantity);

-- 插入测试数据
INSERT INTO products (category_id, product_name, price, stock_quantity)
SELECT
    (random() * 100)::INTEGER,
    'Product ' || generate_series,
    (random() * 1000 + 10)::DECIMAL(10,2),
    (random() * 1000)::INTEGER
FROM generate_series(1, 500000);

3.3 统计信息分析实践

查看当前统计信息

sql
-- 查看表的基础统计信息
SELECT
    schemaname,
    tablename,
    attname,                    -- 列名
    n_distinct,                 -- 不同值的数量估计
    most_common_vals[1:5],      -- 最常见的值
    most_common_freqs[1:5],     -- 最常见值的频率
    correlation                 -- 物理存储与逻辑顺序的相关性
FROM pg_stats
WHERE tablename = 'products'
AND attname IN ('category_id', 'price', 'stock_quantity')
ORDER BY attname;

统计信息质量评估

sql
-- 检查统计信息的新鲜度
SELECT
    schemaname,
    tablename,
    n_tup_ins,                  -- 自上次分析以来插入的行数
    n_tup_upd,                  -- 自上次分析以来更新的行数
    n_tup_del,                  -- 自上次分析以来删除的行数
    last_analyze,               -- 最后一次手动分析时间
    last_autoanalyze,           -- 最后一次自动分析时间
    -- 计算变化比例
    round((n_tup_ins + n_tup_upd + n_tup_del)::numeric /
          GREATEST(n_live_tup, 1) * 100, 2) as change_ratio
FROM pg_stat_user_tables
WHERE relname = 'products';

3.4 执行计划优化示例

模拟价格区间查询场景

sql
-- 模拟大量价格更新(影响统计信息)
UPDATE products
SET price = price * (0.8 + random() * 0.4),
    updated_at = CURRENT_TIMESTAMP
WHERE product_id % 10 = 0;

-- 查看执行计划(统计信息过时)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products
WHERE price BETWEEN 100 AND 200
AND stock_quantity > 50;

更新统计信息后的效果对比

sql
-- 更新统计信息
ANALYZE products;

-- 再次查看执行计划
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products
WHERE price BETWEEN 100 AND 200
AND stock_quantity > 50;

执行计划对比分析

-- 更新前(过时统计信息):
Seq Scan on products (cost=0.00..15000.00 rows=1000 width=50) (actual time=0.500..120.000 rows=15000 loops=1)
  Filter: ((price >= 100) AND (price <= 200) AND (stock_quantity > 50))
  Rows Removed by Filter: 485000

-- 更新后(准确统计信息):
Index Scan using idx_products_price on products (cost=0.42..800.00 rows=15000 width=50) (actual time=0.100..25.000 rows=15000 loops=1)
  Index Cond: ((price >= 100) AND (price <= 200))
  Filter: (stock_quantity > 50)
  Rows Removed by Filter: 2000

性能提升说明更新统计信息后,查询计划器正确估算了结果集大小,选择了索引扫描而非全表扫描,执行时间从 120ms 降低到 25ms,提升了约 80% 的性能。 :::

3.5 分列统计信息调优

针对高选择性列调整统计信息详细程度

sql
-- 为价格列设置更详细的统计信息
ALTER TABLE products ALTER COLUMN price SET STATISTICS 1000;

-- 为低选择性的类别列设置较少的统计信息
ALTER TABLE products ALTER COLUMN category_id SET STATISTICS 50;

-- 重新收集统计信息
ANALYZE products;

-- 查看调整后的统计信息分布
SELECT
    attname,
    n_distinct,
    array_length(most_common_vals, 1) as mcv_count,
    array_length(histogram_bounds, 1) as histogram_buckets
FROM pg_stats
WHERE tablename = 'products'
AND attname IN ('category_id', 'price');

👁️ 4. 可见性映射与仅索引扫描

4.1 可见性映射机制

可见性映射是 PostgreSQL 的一个重要优化特性,它跟踪哪些数据页包含的所有元组对所有事务都是可见的。

Syntax error in textmermaid version 11.8.0

4.2 仅索引扫描优化实例

创建适合仅索引扫描的场景

sql
-- 创建用户活动表(适合仅索引扫描的场景)
CREATE TABLE user_activities (
    user_id INTEGER,
    activity_date DATE,
    activity_count INTEGER,
    last_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 创建覆盖索引
CREATE INDEX idx_user_activities_covering
ON user_activities (user_id, activity_date)
INCLUDE (activity_count);

-- 插入大量历史数据(稳定数据,适合可见性映射优化)
INSERT INTO user_activities (user_id, activity_date, activity_count)
SELECT
    (random() * 10000)::INTEGER,
    CURRENT_DATE - (random() * 365)::INTEGER,
    (random() * 100)::INTEGER
FROM generate_series(1, 1000000);

执行 VACUUM 以建立可见性映射

sql
-- 执行清理以更新可见性映射
VACUUM (VERBOSE) user_activities;

验证仅索引扫描效果

sql
-- 查询能够使用仅索引扫描的SQL
EXPLAIN (ANALYZE, BUFFERS)
SELECT user_id, activity_date, activity_count
FROM user_activities
WHERE user_id = 1001
AND activity_date >= CURRENT_DATE - INTERVAL '30 days';

执行计划分析

Index Only Scan using idx_user_activities_covering on user_activities
  (cost=0.42..15.00 rows=5 width=12) (actual time=0.050..0.250 rows=5 loops=1)
  Index Cond: ((user_id = 1001) AND (activity_date >= (CURRENT_DATE - '30 days'::interval)))
  Heap Fetches: 0    -- 关键指标:没有访问堆页面
  Buffers: shared hit=3

仅索引扫描优化要点

  1. Heap Fetches: 0 表示完全避免了堆页面访问,这是最佳状态
  2. 如果 Heap Fetches > 0,说明部分页面的可见性映射尚未更新,需要执行 VACUUM
  3. 仅索引扫描要求索引包含查询所需的所有列(使用 INCLUDE 子句) :::

4.3 可见性映射监控

创建可见性映射监控视图

sql
-- 查看表的可见性映射状态
SELECT
    schemaname,
    tablename,
    n_tup_ins,
    n_tup_upd,
    n_tup_del,
    last_vacuum,
    last_autovacuum,
    -- 计算可能需要更新可见性映射的比例
    CASE
        WHEN last_vacuum IS NULL AND last_autovacuum IS NULL THEN 'NEVER_VACUUMED'
        WHEN (n_tup_upd + n_tup_del) > n_live_tup * 0.1 THEN 'HIGH_ACTIVITY'
        WHEN (n_tup_upd + n_tup_del) > n_live_tup * 0.05 THEN 'MEDIUM_ACTIVITY'
        ELSE 'LOW_ACTIVITY'
    END as vacuum_priority
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY (n_tup_upd + n_tup_del) DESC;

⚠️ 5. 防止事务 ID 回绕失败

5.1 事务 ID 回绕问题原理

PostgreSQL 使用 32 位事务 ID(XID),这意味着最多可以有约 42 亿个事务。当事务 ID 用完后会回绕到 0,这会导致数据可见性问题。

Syntax error in textmermaid version 11.8.0

5.2 事务 ID 年龄监控

创建事务 ID 监控脚本

sql
-- 查看数据库级别的事务ID年龄
SELECT
    datname,
    age(datfrozenxid) as db_age,
    datfrozenxid,
    -- 计算距离强制清理的剩余事务数
    (2000000000 - age(datfrozenxid)) as transactions_until_wraparound
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY age(datfrozenxid) DESC;

-- 查看表级别的事务ID年龄
SELECT
    schemaname,
    tablename,
    greatest(age(relfrozenxid), age(relminmxid)) as max_age,
    age(relfrozenxid) as xid_age,
    age(relminmxid) as mxid_age,
    -- 评估紧急程度
    CASE
        WHEN greatest(age(relfrozenxid), age(relminmxid)) > 1500000000 THEN 'CRITICAL'
        WHEN greatest(age(relfrozenxid), age(relminmxid)) > 1000000000 THEN 'WARNING'
        WHEN greatest(age(relfrozenxid), age(relminmxid)) > 500000000 THEN 'ATTENTION'
        ELSE 'NORMAL'
    END as status
FROM pg_stat_user_tables t
JOIN pg_class c ON t.relid = c.oid
ORDER BY max_age DESC;

5.3 预防性清理策略

配置参数优化

sql
-- 查看当前防回绕相关参数
SELECT
    name,
    setting,
    unit,
    short_desc
FROM pg_settings
WHERE name IN (
    'autovacuum_freeze_max_age',
    'vacuum_freeze_min_age',
    'vacuum_freeze_table_age',
    'autovacuum_multixact_freeze_max_age'
)
ORDER BY name;

紧急回绕处理程序

sql
-- 创建紧急处理函数
CREATE OR REPLACE FUNCTION emergency_xid_cleanup()
RETURNS TABLE(
    table_name TEXT,
    xid_age BIGINT,
    action_taken TEXT
) AS $$
DECLARE
    table_record RECORD;
    emergency_threshold BIGINT := 1800000000; -- 18亿事务的紧急阈值
BEGIN
    -- 查找需要紧急处理的表
    FOR table_record IN
        SELECT
            t.schemaname,
            t.tablename,
            greatest(age(c.relfrozenxid), age(c.relminmxid)) as max_age
        FROM pg_stat_user_tables t
        JOIN pg_class c ON t.relid = c.oid
        WHERE greatest(age(c.relfrozenxid), age(c.relminmxid)) > emergency_threshold
        ORDER BY max_age DESC
    LOOP
        -- 执行强制清理
        EXECUTE format('VACUUM (FREEZE, VERBOSE) %I.%I',
                      table_record.schemaname,
                      table_record.tablename);

        -- 返回处理结果
        table_name := table_record.schemaname || '.' || table_record.tablename;
        xid_age := table_record.max_age;
        action_taken := 'VACUUM FREEZE executed';
        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

-- 执行紧急清理(当收到回绕警告时)
SELECT * FROM emergency_xid_cleanup();

5.4 回绕预警系统

创建监控告警

sql
-- 创建回绕预警视图
CREATE OR REPLACE VIEW xid_wraparound_alert AS
SELECT
    'DATABASE' as object_type,
    datname as object_name,
    age(datfrozenxid) as current_age,
    (2000000000 - age(datfrozenxid)) as remaining_transactions,
    CASE
        WHEN age(datfrozenxid) > 1800000000 THEN 'CRITICAL - Immediate action required'
        WHEN age(datfrozenxid) > 1500000000 THEN 'HIGH - Schedule emergency maintenance'
        WHEN age(datfrozenxid) > 1000000000 THEN 'MEDIUM - Plan maintenance window'
        ELSE 'LOW - Normal monitoring'
    END as alert_level
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')

UNION ALL

SELECT
    'TABLE' as object_type,
    schemaname || '.' || tablename as object_name,
    greatest(age(c.relfrozenxid), age(c.relminmxid)) as current_age,
    (2000000000 - greatest(age(c.relfrozenxid), age(c.relminmxid))) as remaining_transactions,
    CASE
        WHEN greatest(age(c.relfrozenxid), age(c.relminmxid)) > 1800000000 THEN 'CRITICAL - Immediate action required'
        WHEN greatest(age(c.relfrozenxid), age(c.relminmxid)) > 1500000000 THEN 'HIGH - Schedule emergency maintenance'
        WHEN greatest(age(c.relfrozenxid), age(c.relminmxid)) > 1000000000 THEN 'MEDIUM - Plan maintenance window'
        ELSE 'LOW - Normal monitoring'
    END as alert_level
FROM pg_stat_user_tables t
JOIN pg_class c ON t.relid = c.oid

ORDER BY current_age DESC;

-- 查看当前告警状态
SELECT * FROM xid_wraparound_alert WHERE alert_level != 'LOW - Normal monitoring';

🤖 6. 自动清理守护进程(Autovacuum)

6.1 Autovacuum 工作机制

Syntax error in textmermaid version 11.8.0

6.2 Autovacuum 配置优化

查看当前配置

sql
-- 查看 autovacuum 相关配置
SELECT
    name,
    setting,
    unit,
    short_desc
FROM pg_settings
WHERE name LIKE '%autovacuum%'
OR name LIKE '%vacuum%threshold%'
OR name LIKE '%vacuum%scale%'
ORDER BY name;

业务场景配置实例

sql
-- 针对不同业务表的个性化配置

-- 1. 高频交易表:降低阈值,增加清理频率
ALTER TABLE orders SET (
    autovacuum_vacuum_threshold = 1000,
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_analyze_threshold = 500,
    autovacuum_analyze_scale_factor = 0.02
);

-- 2. 日志表:延长清理间隔,降低系统影响
ALTER TABLE user_logs SET (
    autovacuum_vacuum_threshold = 10000,
    autovacuum_vacuum_scale_factor = 0.2,
    autovacuum_vacuum_cost_delay = 50
);

-- 3. 配置表:几乎不变化,减少自动分析
ALTER TABLE system_config SET (
    autovacuum_analyze_threshold = 1000,
    autovacuum_analyze_scale_factor = 0.1
);

-- 4. 归档表:禁用自动清理(手动管理)
ALTER TABLE historical_data SET (
    autovacuum_enabled = false
);

6.3 Autovacuum 性能监控

创建监控仪表板

sql
-- Autovacuum 活动监控
CREATE OR REPLACE VIEW autovacuum_activity AS
SELECT
    datname,
    usename,
    application_name,
    client_addr,
    backend_start,
    xact_start,
    query_start,
    state_change,
    state,
    query,
    -- 计算运行时长
    CASE
        WHEN state = 'active' THEN
            extract(epoch from now() - query_start)::INTEGER
        ELSE NULL
    END as running_seconds
FROM pg_stat_activity
WHERE application_name = 'autovacuum worker'
OR query LIKE '%autovacuum%'
ORDER BY query_start;

-- 表级别的 autovacuum 统计
CREATE OR REPLACE VIEW table_autovacuum_stats AS
SELECT
    schemaname,
    tablename,
    n_tup_ins,
    n_tup_upd,
    n_tup_del,
    n_live_tup,
    n_dead_tup,
    round((n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1) * 100), 2) as dead_ratio,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count,
    -- 计算 autovacuum 效率
    CASE
        WHEN autovacuum_count > 0 THEN
            round(vacuum_count::float / autovacuum_count, 2)
        ELSE NULL
    END as manual_to_auto_ratio
FROM pg_stat_user_tables
ORDER BY dead_ratio DESC;

-- 查看需要关注的表
SELECT * FROM table_autovacuum_stats
WHERE dead_ratio > 20 OR last_autovacuum < CURRENT_DATE - INTERVAL '7 days';

6.4 问题诊断与调优

常见问题诊断

sql
-- 诊断 autovacuum 未按预期工作的原因
WITH autovacuum_config AS (
    SELECT
        schemaname,
        tablename,
        n_live_tup,
        n_dead_tup,
        -- 计算当前的清理阈值
        COALESCE(
            (SELECT setting::INTEGER FROM pg_settings WHERE name = 'autovacuum_vacuum_threshold'),
            50
        ) as vacuum_threshold,
        COALESCE(
            (SELECT setting::FLOAT FROM pg_settings WHERE name = 'autovacuum_vacuum_scale_factor'),
            0.2
        ) as vacuum_scale_factor
    FROM pg_stat_user_tables
)
SELECT
    schemaname,
    tablename,
    n_live_tup,
    n_dead_tup,
    vacuum_threshold,
    vacuum_scale_factor,
    -- 计算实际阈值
    (vacuum_threshold + vacuum_scale_factor * n_live_tup)::INTEGER as calculated_threshold,
    -- 判断是否应该触发清理
    CASE
        WHEN n_dead_tup >= (vacuum_threshold + vacuum_scale_factor * n_live_tup)
        THEN 'SHOULD_VACUUM'
        ELSE 'NOT_NEEDED'
    END as vacuum_status,
    -- 计算达到阈值还需要多少死行
    GREATEST(0, (vacuum_threshold + vacuum_scale_factor * n_live_tup)::INTEGER - n_dead_tup) as rows_until_vacuum
FROM autovacuum_config
WHERE n_dead_tup > 0
ORDER BY n_dead_tup DESC;

性能调优建议

sql
-- 生成个性化的 autovacuum 调优建议
CREATE OR REPLACE FUNCTION autovacuum_tuning_advice()
RETURNS TABLE(
    table_name TEXT,
    current_dead_ratio NUMERIC,
    advice TEXT,
    suggested_config TEXT
) AS $$
DECLARE
    table_record RECORD;
BEGIN
    FOR table_record IN
        SELECT
            t.schemaname,
            t.tablename,
            t.n_live_tup,
            t.n_dead_tup,
            round((t.n_dead_tup::float / GREATEST(t.n_live_tup + t.n_dead_tup, 1) * 100), 2) as dead_ratio,
            pg_size_pretty(pg_total_relation_size(t.schemaname||'.'||t.tablename)) as table_size,
            t.autovacuum_count,
            extract(days from now() - COALESCE(t.last_autovacuum, t.last_vacuum)) as days_since_vacuum
        FROM pg_stat_user_tables t
        WHERE t.n_live_tup > 1000  -- 只关注有一定规模的表
    LOOP
        table_name := table_record.schemaname || '.' || table_record.tablename;
        current_dead_ratio := table_record.dead_ratio;

        -- 根据不同情况给出建议
        IF table_record.dead_ratio > 50 THEN
            advice := 'Critical: High dead tuple ratio, consider immediate manual VACUUM';
            suggested_config := format('autovacuum_vacuum_scale_factor = 0.1, autovacuum_vacuum_threshold = %s',
                                     LEAST(1000, table_record.n_live_tup * 0.05)::INTEGER);
        ELSIF table_record.dead_ratio > 30 THEN
            advice := 'Warning: Moderate dead tuple ratio, tune autovacuum more aggressively';
            suggested_config := format('autovacuum_vacuum_scale_factor = 0.15, autovacuum_vacuum_threshold = %s',
                                     LEAST(2000, table_record.n_live_tup * 0.1)::INTEGER);
        ELSIF table_record.days_since_vacuum > 7 AND table_record.dead_ratio > 10 THEN
            advice := 'Info: Table not vacuumed recently, check autovacuum activity';
            suggested_config := 'Check autovacuum_naptime and autovacuum_max_workers settings';
        ELSIF table_record.autovacuum_count = 0 AND table_record.dead_ratio > 5 THEN
            advice := 'Info: Autovacuum never ran, check if autovacuum is enabled';
            suggested_config := 'Verify autovacuum = on and check table-specific settings';
        ELSE
            advice := 'OK: Table appears to be managed well by autovacuum';
            suggested_config := 'Current configuration appears adequate';
        END IF;

        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

-- 获取调优建议
SELECT * FROM autovacuum_tuning_advice()
WHERE advice NOT LIKE 'OK:%'
ORDER BY current_dead_ratio DESC;

生产环境注意事项

  1. 资源控制:合理设置 autovacuum_max_workersautovacuum_vacuum_cost_delay
  2. 业务影响:在业务高峰期适当延长 autovacuum_naptime
  3. 监控告警:建立 autovacuum 长时间运行的告警机制
  4. 分区表处理:分区表需要单独处理每个分区,合理规划分区大小 :::

📝 总结与最佳实践

核心要点回顾

  1. 理解 MVCC 机制:UPDATE/DELETE 产生死行,需要 VACUUM 回收空间
  2. 平衡清理策略:标准 VACUUM 用于日常维护,VACUUM FULL 仅用于特殊情况
  3. 重视统计信息:及时的 ANALYZE 是查询性能优化的基础
  4. 监控事务 ID 年龄:防止事务 ID 回绕导致的数据丢失
  5. 合理配置 Autovacuum:根据业务特点个性化配置清理参数

生产环境检查清单

  • 建立表膨胀监控机制
  • 配置事务 ID 年龄告警
  • 根据业务特点调整 autovacuum 参数
  • 定期检查长时间运行的 VACUUM 操作
  • 为大表预留 VACUUM FULL 的维护窗口
  • 监控可见性映射更新状态,优化仅索引扫描效果

通过系统性地应用这些清理策略,可以确保 PostgreSQL 数据库在高负载环境下保持优秀的性能和稳定性。