Skip to content

PostgreSQL 例行清理(Routine Vacuuming)学习笔记

📋 概述

PostgreSQL 的例行清理是数据库维护的核心组成部分。在高并发、高频率更新的业务场景中,如电商平台的订单系统、金融系统的交易记录等,理解和掌握清理机制对于保证数据库性能和稳定性至关重要。

本笔记将通过真实业务场景和完整示例,深入解析 PostgreSQL 的清理机制,帮助您在实际工作中有效管理数据库维护。

🎯 核心价值

PostgreSQL 的 VACUUM 命令解决了四个关键问题:

  • 磁盘空间管理:回收删除和更新操作产生的死行空间
  • 查询性能优化:维护统计信息,确保查询计划器做出最优决策
  • 索引效率提升:更新可见性映射,支持仅索引扫描
  • 数据安全保障:防止事务 ID 回绕导致的数据丢失

🔧 1. 清理基础原理

1.1 MVCC 与死行产生机制

PostgreSQL 使用多版本并发控制(MVCC)来处理并发事务。这意味着 UPDATE 和 DELETE 操作不会立即物理删除数据,而是创建新版本或标记为删除。

1.2 业务场景示例:电商订单系统

假设我们有一个电商平台的订单表,每天处理大量订单状态更新:

sql
-- 创建订单表
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    order_status VARCHAR(20) DEFAULT 'pending',
    order_amount DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 插入测试数据
INSERT INTO orders (customer_id, order_amount)
SELECT
    (random() * 10000)::INTEGER,
    (random() * 1000 + 10)::DECIMAL(10,2)
FROM generate_series(1, 100000);

模拟高频更新场景

sql
-- 模拟订单状态变更(产生大量死行)
UPDATE orders
SET order_status = 'processing',
    updated_at = CURRENT_TIMESTAMP
WHERE order_status = 'pending'
AND order_id % 3 = 0;

UPDATE orders
SET order_status = 'shipped',
    updated_at = CURRENT_TIMESTAMP
WHERE order_status = 'processing'
AND order_id % 5 = 0;

UPDATE orders
SET order_status = 'completed',
    updated_at = CURRENT_TIMESTAMP
WHERE order_status = 'shipped'
AND order_id % 2 = 0;

查看表膨胀情况

sql
-- 查看表的页面使用情况
SELECT
    schemaname,
    tablename,
    attname,
    n_distinct,
    correlation
FROM pg_stats
WHERE tablename = 'orders';

-- 查看表大小和死行统计
SELECT
    relname,
    n_tup_ins,      -- 插入的行数
    n_tup_upd,      -- 更新的行数
    n_tup_del,      -- 删除的行数
    n_live_tup,     -- 活跃行数
    n_dead_tup,     -- 死行数
    round((n_dead_tup::float / (n_live_tup + n_dead_tup) * 100), 2) as dead_ratio
FROM pg_stat_user_tables
WHERE relname = 'orders';

分析输出示例

 relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup | dead_ratio
---------+-----------+-----------+-----------+------------+------------+------------
 orders  |    100000 |     45000 |         0 |     100000 |      45000 |      31.03

当死行比例超过 20% 时,会显著影响查询性能和存储效率,此时应考虑执行 VACUUM 操作。 :::

1.3 VACUUM 类型对比

类型执行速度空间回收锁级别适用场景
VACUUM快速标记可重用较低锁日常维护
VACUUM FULL缓慢完全回收排他锁严重膨胀

🗄️ 2. 磁盘空间回收策略

2.1 标准 VACUUM 的工作原理

2.2 实践示例:清理订单表

执行标准清理

sql
-- 执行详细清理,查看处理过程
VACUUM (VERBOSE, ANALYZE) orders;

清理输出解析

INFO:  vacuuming "public.orders"
INFO:  "orders": removed 45000 row versions in 500 pages
INFO:  "orders": found 45000 removable, 100000 nonremovable row versions in 1000 pages
INFO:  "orders": truncated 1000 to 800 pages
DETAIL:  CPU: user: 0.15 s, system: 0.05 s, elapsed: 0.25 s

清理效果验证

sql
-- 清理前后对比查询
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as data_size,
    n_dead_tup,
    round((n_dead_tup::float / (n_live_tup + n_dead_tup) * 100), 2) as dead_ratio
FROM pg_stat_user_tables t
JOIN pg_tables pt ON t.relname = pt.tablename
WHERE t.relname = 'orders';

2.3 业务场景决策矩阵

场景死行比例表大小推荐策略执行频率
订单系统10-20%< 10GB标准 VACUUM每日
日志表> 50%> 50GBVACUUM FULL周末维护窗口
配置表< 5%< 1GB自动清理系统自动
历史归档> 80%任意TRUNCATE + 重建按需

2.4 空间回收策略实现

监控脚本

sql
-- 创建监控视图,识别需要清理的表
CREATE OR REPLACE VIEW table_bloat_monitor AS
SELECT
    schemaname,
    tablename,
    n_live_tup,
    n_dead_tup,
    round((n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1) * 100), 2) as dead_ratio,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size,
    last_vacuum,
    last_autovacuum,
    CASE
        WHEN n_dead_tup > 10000 AND (n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1)) > 0.2
        THEN 'URGENT'
        WHEN n_dead_tup > 5000 AND (n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1)) > 0.1
        THEN 'MEDIUM'
        ELSE 'LOW'
    END as priority
FROM pg_stat_user_tables
ORDER BY dead_ratio DESC;

-- 查看需要清理的表
SELECT * FROM table_bloat_monitor WHERE priority IN ('URGENT', 'MEDIUM');

自动化清理脚本

sql
-- 批量清理高优先级表的存储过程
CREATE OR REPLACE FUNCTION auto_vacuum_tables()
RETURNS TEXT AS $$
DECLARE
    table_record RECORD;
    result_text TEXT := '';
BEGIN
    -- 遍历需要清理的表
    FOR table_record IN
        SELECT schemaname, tablename, dead_ratio
        FROM table_bloat_monitor
        WHERE priority = 'URGENT'
    LOOP
        -- 执行清理
        EXECUTE format('VACUUM (VERBOSE) %I.%I', table_record.schemaname, table_record.tablename);
        result_text := result_text || format('Vacuumed %s.%s (dead_ratio: %s%%) ',
                                           table_record.schemaname,
                                           table_record.tablename,
                                           table_record.dead_ratio) || E'\n';
    END LOOP;

    RETURN result_text;
END;
$$ LANGUAGE plpgsql;

-- 执行批量清理
SELECT auto_vacuum_tables();

生产环境最佳实践

  1. 监控优先:建立表膨胀监控机制,在问题恶化前主动处理
  2. 分时执行:在业务低峰期执行大表清理,避免影响正常业务
  3. 渐进策略:优先使用标准 VACUUM,仅在必要时使用 VACUUM FULL
  4. 空间预留:为 VACUUM FULL 预留足够的磁盘空间(至少表大小的 2 倍) :::

2.5 特殊场景处理

批量删除场景

sql
-- 场景:清理90天前的订单数据
-- 不推荐的做法(产生大量死行)
-- DELETE FROM orders WHERE created_at < CURRENT_DATE - INTERVAL '90 days';

-- 推荐的做法:使用 TRUNCATE 配合分区
-- 1. 如果有分区表,直接删除分区
-- 2. 如果没有分区,考虑分批删除 + 及时清理

-- 分批删除示例
DO $$
DECLARE
    deleted_count INTEGER;
BEGIN
    LOOP
        -- 每次删除1000条记录
        DELETE FROM orders
        WHERE order_id IN (
            SELECT order_id
            FROM orders
            WHERE created_at < CURRENT_DATE - INTERVAL '90 days'
            LIMIT 1000
        );

        GET DIAGNOSTICS deleted_count = ROW_COUNT;

        -- 如果删除数量为0,退出循环
        IF deleted_count = 0 THEN
            EXIT;
        END IF;

        -- 每删除1000条执行一次清理
        VACUUM orders;

        -- 稍作暂停,避免长时间占用资源
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

这样的分批处理策略可以:

  • 避免长时间锁表
  • 及时回收空间
  • 减少对业务的影响
  • 保持系统响应性

📊 3. 查询计划器统计信息更新

3.1 统计信息的重要性

PostgreSQL 查询计划器依赖统计信息来生成最优的执行计划。过时的统计信息会导致错误的成本估算,从而选择低效的查询路径。

3.2 业务场景:电商商品搜索优化

场景描述:电商平台的商品表经常有价格更新、库存变动等操作,需要及时更新统计信息以保证搜索性能。

sql
-- 创建商品表
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    category_id INTEGER,
    product_name VARCHAR(200),
    price DECIMAL(10,2),
    stock_quantity INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 创建索引
CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_products_price ON products(price);
CREATE INDEX idx_products_stock ON products(stock_quantity);

-- 插入测试数据
INSERT INTO products (category_id, product_name, price, stock_quantity)
SELECT
    (random() * 100)::INTEGER,
    'Product ' || generate_series,
    (random() * 1000 + 10)::DECIMAL(10,2),
    (random() * 1000)::INTEGER
FROM generate_series(1, 500000);

3.3 统计信息分析实践

查看当前统计信息

sql
-- 查看表的基础统计信息
SELECT
    schemaname,
    tablename,
    attname,                    -- 列名
    n_distinct,                 -- 不同值的数量估计
    most_common_vals[1:5],      -- 最常见的值
    most_common_freqs[1:5],     -- 最常见值的频率
    correlation                 -- 物理存储与逻辑顺序的相关性
FROM pg_stats
WHERE tablename = 'products'
AND attname IN ('category_id', 'price', 'stock_quantity')
ORDER BY attname;

统计信息质量评估

sql
-- 检查统计信息的新鲜度
SELECT
    schemaname,
    tablename,
    n_tup_ins,                  -- 自上次分析以来插入的行数
    n_tup_upd,                  -- 自上次分析以来更新的行数
    n_tup_del,                  -- 自上次分析以来删除的行数
    last_analyze,               -- 最后一次手动分析时间
    last_autoanalyze,           -- 最后一次自动分析时间
    -- 计算变化比例
    round((n_tup_ins + n_tup_upd + n_tup_del)::numeric /
          GREATEST(n_live_tup, 1) * 100, 2) as change_ratio
FROM pg_stat_user_tables
WHERE relname = 'products';

3.4 执行计划优化示例

模拟价格区间查询场景

sql
-- 模拟大量价格更新(影响统计信息)
UPDATE products
SET price = price * (0.8 + random() * 0.4),
    updated_at = CURRENT_TIMESTAMP
WHERE product_id % 10 = 0;

-- 查看执行计划(统计信息过时)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products
WHERE price BETWEEN 100 AND 200
AND stock_quantity > 50;

更新统计信息后的效果对比

sql
-- 更新统计信息
ANALYZE products;

-- 再次查看执行计划
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products
WHERE price BETWEEN 100 AND 200
AND stock_quantity > 50;

执行计划对比分析

-- 更新前(过时统计信息):
Seq Scan on products (cost=0.00..15000.00 rows=1000 width=50) (actual time=0.500..120.000 rows=15000 loops=1)
  Filter: ((price >= 100) AND (price <= 200) AND (stock_quantity > 50))
  Rows Removed by Filter: 485000

-- 更新后(准确统计信息):
Index Scan using idx_products_price on products (cost=0.42..800.00 rows=15000 width=50) (actual time=0.100..25.000 rows=15000 loops=1)
  Index Cond: ((price >= 100) AND (price <= 200))
  Filter: (stock_quantity > 50)
  Rows Removed by Filter: 2000

性能提升说明更新统计信息后,查询计划器正确估算了结果集大小,选择了索引扫描而非全表扫描,执行时间从 120ms 降低到 25ms,提升了约 80% 的性能。 :::

3.5 分列统计信息调优

针对高选择性列调整统计信息详细程度

sql
-- 为价格列设置更详细的统计信息
ALTER TABLE products ALTER COLUMN price SET STATISTICS 1000;

-- 为低选择性的类别列设置较少的统计信息
ALTER TABLE products ALTER COLUMN category_id SET STATISTICS 50;

-- 重新收集统计信息
ANALYZE products;

-- 查看调整后的统计信息分布
SELECT
    attname,
    n_distinct,
    array_length(most_common_vals, 1) as mcv_count,
    array_length(histogram_bounds, 1) as histogram_buckets
FROM pg_stats
WHERE tablename = 'products'
AND attname IN ('category_id', 'price');

👁️ 4. 可见性映射与仅索引扫描

4.1 可见性映射机制

可见性映射是 PostgreSQL 的一个重要优化特性,它跟踪哪些数据页包含的所有元组对所有事务都是可见的。

4.2 仅索引扫描优化实例

创建适合仅索引扫描的场景

sql
-- 创建用户活动表(适合仅索引扫描的场景)
CREATE TABLE user_activities (
    user_id INTEGER,
    activity_date DATE,
    activity_count INTEGER,
    last_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 创建覆盖索引
CREATE INDEX idx_user_activities_covering
ON user_activities (user_id, activity_date)
INCLUDE (activity_count);

-- 插入大量历史数据(稳定数据,适合可见性映射优化)
INSERT INTO user_activities (user_id, activity_date, activity_count)
SELECT
    (random() * 10000)::INTEGER,
    CURRENT_DATE - (random() * 365)::INTEGER,
    (random() * 100)::INTEGER
FROM generate_series(1, 1000000);

执行 VACUUM 以建立可见性映射

sql
-- 执行清理以更新可见性映射
VACUUM (VERBOSE) user_activities;

验证仅索引扫描效果

sql
-- 查询能够使用仅索引扫描的SQL
EXPLAIN (ANALYZE, BUFFERS)
SELECT user_id, activity_date, activity_count
FROM user_activities
WHERE user_id = 1001
AND activity_date >= CURRENT_DATE - INTERVAL '30 days';

执行计划分析

Index Only Scan using idx_user_activities_covering on user_activities
  (cost=0.42..15.00 rows=5 width=12) (actual time=0.050..0.250 rows=5 loops=1)
  Index Cond: ((user_id = 1001) AND (activity_date >= (CURRENT_DATE - '30 days'::interval)))
  Heap Fetches: 0    -- 关键指标:没有访问堆页面
  Buffers: shared hit=3

仅索引扫描优化要点

  1. Heap Fetches: 0 表示完全避免了堆页面访问,这是最佳状态
  2. 如果 Heap Fetches > 0,说明部分页面的可见性映射尚未更新,需要执行 VACUUM
  3. 仅索引扫描要求索引包含查询所需的所有列(使用 INCLUDE 子句) :::

4.3 可见性映射监控

创建可见性映射监控视图

sql
-- 查看表的可见性映射状态
SELECT
    schemaname,
    tablename,
    n_tup_ins,
    n_tup_upd,
    n_tup_del,
    last_vacuum,
    last_autovacuum,
    -- 计算可能需要更新可见性映射的比例
    CASE
        WHEN last_vacuum IS NULL AND last_autovacuum IS NULL THEN 'NEVER_VACUUMED'
        WHEN (n_tup_upd + n_tup_del) > n_live_tup * 0.1 THEN 'HIGH_ACTIVITY'
        WHEN (n_tup_upd + n_tup_del) > n_live_tup * 0.05 THEN 'MEDIUM_ACTIVITY'
        ELSE 'LOW_ACTIVITY'
    END as vacuum_priority
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY (n_tup_upd + n_tup_del) DESC;

⚠️ 5. 防止事务 ID 回绕失败

5.1 事务 ID 回绕问题原理

PostgreSQL 使用 32 位事务 ID(XID),这意味着最多可以有约 42 亿个事务。当事务 ID 用完后会回绕到 0,这会导致数据可见性问题。

5.2 事务 ID 年龄监控

创建事务 ID 监控脚本

sql
-- 查看数据库级别的事务ID年龄
SELECT
    datname,
    age(datfrozenxid) as db_age,
    datfrozenxid,
    -- 计算距离强制清理的剩余事务数
    (2000000000 - age(datfrozenxid)) as transactions_until_wraparound
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY age(datfrozenxid) DESC;

-- 查看表级别的事务ID年龄
SELECT
    schemaname,
    tablename,
    greatest(age(relfrozenxid), age(relminmxid)) as max_age,
    age(relfrozenxid) as xid_age,
    age(relminmxid) as mxid_age,
    -- 评估紧急程度
    CASE
        WHEN greatest(age(relfrozenxid), age(relminmxid)) > 1500000000 THEN 'CRITICAL'
        WHEN greatest(age(relfrozenxid), age(relminmxid)) > 1000000000 THEN 'WARNING'
        WHEN greatest(age(relfrozenxid), age(relminmxid)) > 500000000 THEN 'ATTENTION'
        ELSE 'NORMAL'
    END as status
FROM pg_stat_user_tables t
JOIN pg_class c ON t.relid = c.oid
ORDER BY max_age DESC;

5.3 预防性清理策略

配置参数优化

sql
-- 查看当前防回绕相关参数
SELECT
    name,
    setting,
    unit,
    short_desc
FROM pg_settings
WHERE name IN (
    'autovacuum_freeze_max_age',
    'vacuum_freeze_min_age',
    'vacuum_freeze_table_age',
    'autovacuum_multixact_freeze_max_age'
)
ORDER BY name;

紧急回绕处理程序

sql
-- 创建紧急处理函数
CREATE OR REPLACE FUNCTION emergency_xid_cleanup()
RETURNS TABLE(
    table_name TEXT,
    xid_age BIGINT,
    action_taken TEXT
) AS $$
DECLARE
    table_record RECORD;
    emergency_threshold BIGINT := 1800000000; -- 18亿事务的紧急阈值
BEGIN
    -- 查找需要紧急处理的表
    FOR table_record IN
        SELECT
            t.schemaname,
            t.tablename,
            greatest(age(c.relfrozenxid), age(c.relminmxid)) as max_age
        FROM pg_stat_user_tables t
        JOIN pg_class c ON t.relid = c.oid
        WHERE greatest(age(c.relfrozenxid), age(c.relminmxid)) > emergency_threshold
        ORDER BY max_age DESC
    LOOP
        -- 执行强制清理
        EXECUTE format('VACUUM (FREEZE, VERBOSE) %I.%I',
                      table_record.schemaname,
                      table_record.tablename);

        -- 返回处理结果
        table_name := table_record.schemaname || '.' || table_record.tablename;
        xid_age := table_record.max_age;
        action_taken := 'VACUUM FREEZE executed';
        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

-- 执行紧急清理(当收到回绕警告时)
SELECT * FROM emergency_xid_cleanup();

5.4 回绕预警系统

创建监控告警

sql
-- 创建回绕预警视图
CREATE OR REPLACE VIEW xid_wraparound_alert AS
SELECT
    'DATABASE' as object_type,
    datname as object_name,
    age(datfrozenxid) as current_age,
    (2000000000 - age(datfrozenxid)) as remaining_transactions,
    CASE
        WHEN age(datfrozenxid) > 1800000000 THEN 'CRITICAL - Immediate action required'
        WHEN age(datfrozenxid) > 1500000000 THEN 'HIGH - Schedule emergency maintenance'
        WHEN age(datfrozenxid) > 1000000000 THEN 'MEDIUM - Plan maintenance window'
        ELSE 'LOW - Normal monitoring'
    END as alert_level
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')

UNION ALL

SELECT
    'TABLE' as object_type,
    schemaname || '.' || tablename as object_name,
    greatest(age(c.relfrozenxid), age(c.relminmxid)) as current_age,
    (2000000000 - greatest(age(c.relfrozenxid), age(c.relminmxid))) as remaining_transactions,
    CASE
        WHEN greatest(age(c.relfrozenxid), age(c.relminmxid)) > 1800000000 THEN 'CRITICAL - Immediate action required'
        WHEN greatest(age(c.relfrozenxid), age(c.relminmxid)) > 1500000000 THEN 'HIGH - Schedule emergency maintenance'
        WHEN greatest(age(c.relfrozenxid), age(c.relminmxid)) > 1000000000 THEN 'MEDIUM - Plan maintenance window'
        ELSE 'LOW - Normal monitoring'
    END as alert_level
FROM pg_stat_user_tables t
JOIN pg_class c ON t.relid = c.oid

ORDER BY current_age DESC;

-- 查看当前告警状态
SELECT * FROM xid_wraparound_alert WHERE alert_level != 'LOW - Normal monitoring';

🤖 6. 自动清理守护进程(Autovacuum)

6.1 Autovacuum 工作机制

6.2 Autovacuum 配置优化

查看当前配置

sql
-- 查看 autovacuum 相关配置
SELECT
    name,
    setting,
    unit,
    short_desc
FROM pg_settings
WHERE name LIKE '%autovacuum%'
OR name LIKE '%vacuum%threshold%'
OR name LIKE '%vacuum%scale%'
ORDER BY name;

业务场景配置实例

sql
-- 针对不同业务表的个性化配置

-- 1. 高频交易表:降低阈值,增加清理频率
ALTER TABLE orders SET (
    autovacuum_vacuum_threshold = 1000,
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_analyze_threshold = 500,
    autovacuum_analyze_scale_factor = 0.02
);

-- 2. 日志表:延长清理间隔,降低系统影响
ALTER TABLE user_logs SET (
    autovacuum_vacuum_threshold = 10000,
    autovacuum_vacuum_scale_factor = 0.2,
    autovacuum_vacuum_cost_delay = 50
);

-- 3. 配置表:几乎不变化,减少自动分析
ALTER TABLE system_config SET (
    autovacuum_analyze_threshold = 1000,
    autovacuum_analyze_scale_factor = 0.1
);

-- 4. 归档表:禁用自动清理(手动管理)
ALTER TABLE historical_data SET (
    autovacuum_enabled = false
);

6.3 Autovacuum 性能监控

创建监控仪表板

sql
-- Autovacuum 活动监控
CREATE OR REPLACE VIEW autovacuum_activity AS
SELECT
    datname,
    usename,
    application_name,
    client_addr,
    backend_start,
    xact_start,
    query_start,
    state_change,
    state,
    query,
    -- 计算运行时长
    CASE
        WHEN state = 'active' THEN
            extract(epoch from now() - query_start)::INTEGER
        ELSE NULL
    END as running_seconds
FROM pg_stat_activity
WHERE application_name = 'autovacuum worker'
OR query LIKE '%autovacuum%'
ORDER BY query_start;

-- 表级别的 autovacuum 统计
CREATE OR REPLACE VIEW table_autovacuum_stats AS
SELECT
    schemaname,
    tablename,
    n_tup_ins,
    n_tup_upd,
    n_tup_del,
    n_live_tup,
    n_dead_tup,
    round((n_dead_tup::float / GREATEST(n_live_tup + n_dead_tup, 1) * 100), 2) as dead_ratio,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count,
    -- 计算 autovacuum 效率
    CASE
        WHEN autovacuum_count > 0 THEN
            round(vacuum_count::float / autovacuum_count, 2)
        ELSE NULL
    END as manual_to_auto_ratio
FROM pg_stat_user_tables
ORDER BY dead_ratio DESC;

-- 查看需要关注的表
SELECT * FROM table_autovacuum_stats
WHERE dead_ratio > 20 OR last_autovacuum < CURRENT_DATE - INTERVAL '7 days';

6.4 问题诊断与调优

常见问题诊断

sql
-- 诊断 autovacuum 未按预期工作的原因
WITH autovacuum_config AS (
    SELECT
        schemaname,
        tablename,
        n_live_tup,
        n_dead_tup,
        -- 计算当前的清理阈值
        COALESCE(
            (SELECT setting::INTEGER FROM pg_settings WHERE name = 'autovacuum_vacuum_threshold'),
            50
        ) as vacuum_threshold,
        COALESCE(
            (SELECT setting::FLOAT FROM pg_settings WHERE name = 'autovacuum_vacuum_scale_factor'),
            0.2
        ) as vacuum_scale_factor
    FROM pg_stat_user_tables
)
SELECT
    schemaname,
    tablename,
    n_live_tup,
    n_dead_tup,
    vacuum_threshold,
    vacuum_scale_factor,
    -- 计算实际阈值
    (vacuum_threshold + vacuum_scale_factor * n_live_tup)::INTEGER as calculated_threshold,
    -- 判断是否应该触发清理
    CASE
        WHEN n_dead_tup >= (vacuum_threshold + vacuum_scale_factor * n_live_tup)
        THEN 'SHOULD_VACUUM'
        ELSE 'NOT_NEEDED'
    END as vacuum_status,
    -- 计算达到阈值还需要多少死行
    GREATEST(0, (vacuum_threshold + vacuum_scale_factor * n_live_tup)::INTEGER - n_dead_tup) as rows_until_vacuum
FROM autovacuum_config
WHERE n_dead_tup > 0
ORDER BY n_dead_tup DESC;

性能调优建议

sql
-- 生成个性化的 autovacuum 调优建议
CREATE OR REPLACE FUNCTION autovacuum_tuning_advice()
RETURNS TABLE(
    table_name TEXT,
    current_dead_ratio NUMERIC,
    advice TEXT,
    suggested_config TEXT
) AS $$
DECLARE
    table_record RECORD;
BEGIN
    FOR table_record IN
        SELECT
            t.schemaname,
            t.tablename,
            t.n_live_tup,
            t.n_dead_tup,
            round((t.n_dead_tup::float / GREATEST(t.n_live_tup + t.n_dead_tup, 1) * 100), 2) as dead_ratio,
            pg_size_pretty(pg_total_relation_size(t.schemaname||'.'||t.tablename)) as table_size,
            t.autovacuum_count,
            extract(days from now() - COALESCE(t.last_autovacuum, t.last_vacuum)) as days_since_vacuum
        FROM pg_stat_user_tables t
        WHERE t.n_live_tup > 1000  -- 只关注有一定规模的表
    LOOP
        table_name := table_record.schemaname || '.' || table_record.tablename;
        current_dead_ratio := table_record.dead_ratio;

        -- 根据不同情况给出建议
        IF table_record.dead_ratio > 50 THEN
            advice := 'Critical: High dead tuple ratio, consider immediate manual VACUUM';
            suggested_config := format('autovacuum_vacuum_scale_factor = 0.1, autovacuum_vacuum_threshold = %s',
                                     LEAST(1000, table_record.n_live_tup * 0.05)::INTEGER);
        ELSIF table_record.dead_ratio > 30 THEN
            advice := 'Warning: Moderate dead tuple ratio, tune autovacuum more aggressively';
            suggested_config := format('autovacuum_vacuum_scale_factor = 0.15, autovacuum_vacuum_threshold = %s',
                                     LEAST(2000, table_record.n_live_tup * 0.1)::INTEGER);
        ELSIF table_record.days_since_vacuum > 7 AND table_record.dead_ratio > 10 THEN
            advice := 'Info: Table not vacuumed recently, check autovacuum activity';
            suggested_config := 'Check autovacuum_naptime and autovacuum_max_workers settings';
        ELSIF table_record.autovacuum_count = 0 AND table_record.dead_ratio > 5 THEN
            advice := 'Info: Autovacuum never ran, check if autovacuum is enabled';
            suggested_config := 'Verify autovacuum = on and check table-specific settings';
        ELSE
            advice := 'OK: Table appears to be managed well by autovacuum';
            suggested_config := 'Current configuration appears adequate';
        END IF;

        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

-- 获取调优建议
SELECT * FROM autovacuum_tuning_advice()
WHERE advice NOT LIKE 'OK:%'
ORDER BY current_dead_ratio DESC;

生产环境注意事项

  1. 资源控制:合理设置 autovacuum_max_workersautovacuum_vacuum_cost_delay
  2. 业务影响:在业务高峰期适当延长 autovacuum_naptime
  3. 监控告警:建立 autovacuum 长时间运行的告警机制
  4. 分区表处理:分区表需要单独处理每个分区,合理规划分区大小 :::

📝 总结与最佳实践

核心要点回顾

  1. 理解 MVCC 机制:UPDATE/DELETE 产生死行,需要 VACUUM 回收空间
  2. 平衡清理策略:标准 VACUUM 用于日常维护,VACUUM FULL 仅用于特殊情况
  3. 重视统计信息:及时的 ANALYZE 是查询性能优化的基础
  4. 监控事务 ID 年龄:防止事务 ID 回绕导致的数据丢失
  5. 合理配置 Autovacuum:根据业务特点个性化配置清理参数

生产环境检查清单

  • 建立表膨胀监控机制
  • 配置事务 ID 年龄告警
  • 根据业务特点调整 autovacuum 参数
  • 定期检查长时间运行的 VACUUM 操作
  • 为大表预留 VACUUM FULL 的维护窗口
  • 监控可见性映射更新状态,优化仅索引扫描效果

通过系统性地应用这些清理策略,可以确保 PostgreSQL 数据库在高负载环境下保持优秀的性能和稳定性。