Appearance
PostgreSQL 进度报告详解
概述
PostgreSQL 提供了强大的进度报告功能,让我们能够实时监控长时间运行的数据库操作。这对于生产环境中的运维管理至关重要,特别是在处理大型数据库时,能够帮助我们:
- 预估完成时间:了解操作还需要多长时间完成
- 资源监控:观察系统资源使用情况
- 问题诊断:识别性能瓶颈和异常情况
- 用户体验:为应用层提供进度反馈
INFO
PostgreSQL 当前支持进度报告的命令包括:ANALYZE
、CLUSTER
、CREATE INDEX
、VACUUM
、COPY
和 BASE_BACKUP
ANALYZE 进度报告
业务场景
在数据仓库环境中,我们经常需要对大表执行 ANALYZE
来更新统计信息,这对查询优化器生成高效执行计划至关重要。
监控视图:pg_stat_progress_analyze
列名 | 类型 | 描述 | 业务意义 |
---|---|---|---|
pid | integer | 后端进程ID | 用于识别和管理特定的分析任务 |
datname | name | 数据库名称 | 确认正在分析的数据库 |
relid | oid | 表的OID | 识别正在分析的具体表 |
phase | text | 当前阶段 | 了解分析进展到哪个步骤 |
sample_blks_total | bigint | 待采样总块数 | 预估工作量 |
sample_blks_scanned | bigint | 已扫描块数 | 当前进度 |
实际应用示例
问题陈述:需要对一个包含1000万行记录的订单表执行分析,并实时监控进度。
解决方案:
sql
-- 会话1:启动分析任务
ANALYZE orders;
-- 会话2:监控分析进度
SELECT
pid,
datname,
phase,
sample_blks_scanned,
sample_blks_total,
ROUND(
(sample_blks_scanned::numeric / sample_blks_total::numeric) * 100, 2
) AS progress_percentage,
ext_stats_computed,
ext_stats_total
FROM pg_stat_progress_analyze
WHERE relid = 'orders'::regclass;
分析过程:
- 初始化阶段:PostgreSQL 准备分析任务,时间很短
- 采样阶段:按照
default_statistics_target
设置采样数据 - 统计计算:基于样本计算表的统计信息
- 更新系统表:将统计信息写入
pg_statistic
输入和输出示例:
sql
-- 输入:监控查询
SELECT pid, phase, sample_blks_scanned, sample_blks_total
FROM pg_stat_progress_analyze;
-- 输出:进度信息
pid | phase | sample_blks_scanned | sample_blks_total
------+--------------------------+--------------------+-------------------
1234 | acquiring sample rows | 1500 | 25000
优化建议
对于大表分析,可以调整 default_statistics_target
参数来平衡分析精度和执行时间
CLUSTER 进度报告
业务场景
在数据整理和性能优化中,CLUSTER
操作能够根据索引物理重排表数据,提高查询性能。这在 OLAP 场景中特别有用。
监控视图:pg_stat_progress_cluster
sql
-- 监控 CLUSTER 操作进度
SELECT
pid,
command, -- CLUSTER 或 VACUUM FULL
phase,
cluster_index_relid::regclass AS cluster_index,
heap_tuples_scanned,
heap_tuples_written,
heap_blks_total,
heap_blks_scanned,
ROUND(
(heap_blks_scanned::numeric / heap_blks_total::numeric) * 100, 2
) AS scan_progress_percentage
FROM pg_stat_progress_cluster;
实际应用示例
问题陈述:对按时间分区的日志表执行 CLUSTER 操作,按时间戳索引重新排列数据以提高查询性能。
解决方案:
sql
-- 创建测试表和索引
CREATE TABLE access_logs (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ,
user_id INTEGER,
action TEXT
);
-- 创建时间戳索引
CREATE INDEX idx_access_logs_timestamp ON access_logs(timestamp);
-- 执行 CLUSTER 操作
CLUSTER access_logs USING idx_access_logs_timestamp;
监控脚本:
sql
-- 实时监控 CLUSTER 进度
WITH cluster_progress AS (
SELECT
pid,
phase,
heap_tuples_scanned,
heap_tuples_written,
heap_blks_total,
heap_blks_scanned,
CASE
WHEN heap_blks_total > 0 THEN
ROUND((heap_blks_scanned::numeric / heap_blks_total::numeric) * 100, 2)
ELSE 0
END AS progress_pct
FROM pg_stat_progress_cluster
)
SELECT
pid,
phase,
progress_pct || '%' AS progress,
heap_tuples_scanned AS tuples_processed,
heap_tuples_written AS tuples_written
FROM cluster_progress;
注意事项
CLUSTER 操作会锁定表,在生产环境中建议在维护窗口期间执行
COPY 进度报告
业务场景
数据迁移和批量导入是数据库管理中的常见任务。COPY
命令的进度监控对于大文件导入非常重要。
监控视图:pg_stat_progress_copy
关键指标 | 说明 | 业务价值 |
---|---|---|
bytes_processed | 已处理字节数 | 了解数据传输进度 |
bytes_total | 总文件大小 | 预估完成时间 |
tuples_processed | 已处理元组数 | 了解记录导入数量 |
tuples_excluded | 被WHERE排除的元组 | 了解过滤效果 |
实际应用示例
问题陈述:从 CSV 文件导入 500 万条用户数据,需要实时监控导入进度和错误处理。
解决方案:
sql
-- 创建目标表
CREATE TABLE users_import (
id INTEGER,
name VARCHAR(100),
email VARCHAR(100),
created_at TIMESTAMP,
status VARCHAR(20)
);
-- 执行 COPY 操作(在一个会话中)
COPY users_import FROM '/data/users.csv'
WITH (
FORMAT csv,
HEADER true,
DELIMITER ',',
ON_ERROR stop -- 遇到错误时停止
);
监控脚本:
sql
-- 监控 COPY 进度(在另一个会话中)
SELECT
pid,
command,
type,
bytes_processed,
bytes_total,
CASE
WHEN bytes_total > 0 THEN
ROUND((bytes_processed::numeric / bytes_total::numeric) * 100, 2)
ELSE 0
END AS progress_percentage,
tuples_processed,
tuples_excluded,
tuples_skipped,
-- 计算处理速度 (MB/s)
ROUND(bytes_processed / 1024.0 / 1024.0, 2) AS mb_processed
FROM pg_stat_progress_copy;
输出示例:
pid | command | type | bytes_processed | bytes_total | progress_percentage | tuples_processed
------+-----------+------+----------------+-------------+--------------------+------------------
5678 | COPY FROM | FILE | 104857600 | 524288000 | 20.00 | 100000
高级监控查询:
sql
-- 包含速度和预估时间的详细监控
WITH copy_stats AS (
SELECT
pid,
command,
bytes_processed,
bytes_total,
tuples_processed,
extract(epoch from (now() - pg_stat_get_backend_start(pid))) AS elapsed_seconds
FROM pg_stat_progress_copy
)
SELECT
pid,
command,
ROUND(bytes_processed / 1024.0 / 1024.0, 2) AS mb_processed,
ROUND(bytes_total / 1024.0 / 1024.0, 2) AS mb_total,
ROUND((bytes_processed::numeric / bytes_total::numeric) * 100, 2) AS progress_pct,
tuples_processed,
ROUND(bytes_processed / 1024.0 / 1024.0 / elapsed_seconds, 2) AS mb_per_second,
ROUND((bytes_total - bytes_processed) / 1024.0 / 1024.0 /
(bytes_processed / 1024.0 / 1024.0 / elapsed_seconds), 0) AS eta_seconds
FROM copy_stats
WHERE bytes_total > 0;
错误处理策略
使用 ON_ERROR
选项可以控制遇到错误时的行为:
stop
:遇到错误立即停止(默认)ignore
:忽略错误行继续处理
CREATE INDEX 进度报告
业务场景
在生产环境中创建索引,特别是在大表上创建索引,可能需要几个小时甚至更长时间。进度监控帮助我们:
- 评估索引创建对系统性能的影响
- 决定是否需要并发索引创建
- 监控不同阶段的资源使用
监控视图:pg_stat_progress_create_index
实际应用示例
问题陈述:在包含 2000 万行的交易表上创建复合索引,需要使用并发方式以减少对业务的影响。
解决方案:
sql
-- 创建并发索引以减少锁定时间
CREATE INDEX CONCURRENTLY idx_transactions_user_date
ON transactions(user_id, transaction_date)
WHERE status = 'completed';
详细监控脚本:
sql
-- 全面的索引创建进度监控
SELECT
pid,
datname,
schemaname,
tablename,
indexname,
command,
phase,
-- 锁等待信息
lockers_total,
lockers_done,
current_locker_pid,
-- 处理进度
blocks_total,
blocks_done,
CASE
WHEN blocks_total > 0 THEN
ROUND((blocks_done::numeric / blocks_total::numeric) * 100, 2)
ELSE 0
END AS blocks_progress_pct,
-- 元组处理进度
tuples_total,
tuples_done,
CASE
WHEN tuples_total > 0 THEN
ROUND((tuples_done::numeric / tuples_total::numeric) * 100, 2)
ELSE 0
END AS tuples_progress_pct,
-- 分区信息(如果适用)
partitions_total,
partitions_done
FROM pg_stat_progress_create_index pci
LEFT JOIN pg_stat_user_indexes pui ON pci.index_relid = pui.indexrelid
LEFT JOIN pg_stat_user_tables put ON pci.relid = put.relid;
阶段解释和监控要点:
sql
-- 查看当前是否在初始化阶段
SELECT pid, phase, command
FROM pg_stat_progress_create_index
WHERE phase = 'initializing';
-- 这个阶段通常很短,主要是准备工作
sql
-- 监控锁等待情况
SELECT
pid,
phase,
lockers_total,
lockers_done,
current_locker_pid,
ROUND((lockers_done::numeric / lockers_total::numeric) * 100, 2) AS lock_progress
FROM pg_stat_progress_create_index
WHERE phase LIKE '%waiting%';
sql
-- 监控索引构建进度
SELECT
pid,
phase,
blocks_done,
blocks_total,
tuples_done,
tuples_total,
ROUND((blocks_done::numeric / blocks_total::numeric) * 100, 2) AS progress
FROM pg_stat_progress_create_index
WHERE phase = 'building index';
性能优化建议:
索引创建优化
- 增加 work_mem:提高排序操作的内存使用
- 调整 maintenance_work_mem:影响索引创建速度
- 并发创建:使用
CREATE INDEX CONCURRENTLY
减少锁定时间 - 分批创建:对于多个索引,避免同时创建
VACUUM 进度报告
业务场景
VACUUM
是 PostgreSQL 维护的核心操作,定期执行以回收空间和更新统计信息。在大表上的 VACUUM 操作可能耗时很长。
监控视图:pg_stat_progress_vacuum
实际应用示例
问题陈述:对一个频繁更新的用户行为日志表执行 VACUUM 操作,需要监控空间回收效果。
解决方案:
sql
-- 执行 VACUUM 操作
VACUUM (VERBOSE, ANALYZE) user_behavior_logs;
综合监控脚本:
sql
-- VACUUM 进度的全面监控
WITH vacuum_progress AS (
SELECT
pid,
datname,
schemaname,
tablename,
phase,
heap_blks_total,
heap_blks_scanned,
heap_blks_vacuumed,
index_vacuum_count,
max_dead_tuple_bytes,
dead_tuple_bytes,
num_dead_item_ids,
indexes_total,
indexes_processed
FROM pg_stat_progress_vacuum pv
LEFT JOIN pg_stat_user_tables put ON pv.relid = put.relid
),
vacuum_metrics AS (
SELECT
*,
CASE
WHEN heap_blks_total > 0 THEN
ROUND((heap_blks_scanned::numeric / heap_blks_total::numeric) * 100, 2)
ELSE 0
END AS scan_progress_pct,
CASE
WHEN max_dead_tuple_bytes > 0 THEN
ROUND((dead_tuple_bytes::numeric / max_dead_tuple_bytes::numeric) * 100, 2)
ELSE 0
END AS dead_tuple_fill_pct,
CASE
WHEN indexes_total > 0 THEN
ROUND((indexes_processed::numeric / indexes_total::numeric) * 100, 2)
ELSE 0
END AS index_progress_pct
FROM vacuum_progress
)
SELECT
pid,
tablename,
phase,
scan_progress_pct || '%' AS scan_progress,
heap_blks_scanned || '/' || heap_blks_total AS blocks_scanned,
heap_blks_vacuumed AS blocks_vacuumed,
index_vacuum_count AS index_cycles,
dead_tuple_fill_pct || '%' AS dead_tuple_buffer,
ROUND(dead_tuple_bytes / 1024.0 / 1024.0, 2) AS dead_tuple_mb,
index_progress_pct || '%' AS index_progress
FROM vacuum_metrics;
阶段详细分析:
阶段 | 作用 | 监控重点 | 优化建议 |
---|---|---|---|
扫描堆 | 标记已删除元组 | heap_blks_scanned / heap_blks_total | 增加 maintenance_work_mem |
清理索引 | 清理索引中的死元组 | index_vacuum_count | 监控索引数量和大小 |
清理堆 | 回收堆中的空间 | heap_blks_vacuumed | 观察实际回收的块数 |
截断堆 | 释放文件末尾的空页 | 无特定指标 | 确保有足够的 ACCESS EXCLUSIVE 锁 |
死元组管理监控:
sql
-- 监控死元组累积情况
SELECT
pid,
tablename,
phase,
dead_tuple_bytes,
max_dead_tuple_bytes,
num_dead_item_ids,
ROUND(
(dead_tuple_bytes::numeric / max_dead_tuple_bytes::numeric) * 100, 2
) AS memory_usage_pct,
CASE
WHEN dead_tuple_bytes >= max_dead_tuple_bytes THEN
'FULL - 将触发索引清理'
WHEN dead_tuple_bytes > max_dead_tuple_bytes * 0.8 THEN
'HIGH - 接近触发阈值'
ELSE
'OK'
END AS memory_status
FROM pg_stat_progress_vacuum pv
LEFT JOIN pg_stat_user_tables put ON pv.relid = put.relid;
重要提醒
当 dead_tuple_bytes
达到 max_dead_tuple_bytes
时,VACUUM 将自动触发索引清理循环,这可能显著增加执行时间
基本备份进度报告
业务场景
基本备份是 PostgreSQL 灾难恢复的重要组成部分。对于大型数据库,备份可能需要几个小时,进度监控至关重要。
监控视图:pg_stat_progress_basebackup
实际应用示例
问题陈述:执行生产数据库的基本备份,需要监控备份进度和预估完成时间。
解决方案:
bash
# 启动 pg_basebackup(在终端中执行)
pg_basebackup -D /backup/postgres_backup_$(date +%Y%m%d) \
-Ft -z -P -v \
--wal-method=stream \
--write-recovery-conf
监控脚本:
sql
-- 监控基本备份进度
WITH backup_progress AS (
SELECT
pid,
phase,
backup_total,
backup_streamed,
tablespaces_total,
tablespaces_streamed,
CASE
WHEN backup_total > 0 THEN
ROUND((backup_streamed::numeric / backup_total::numeric) * 100, 2)
ELSE 0
END AS backup_progress_pct,
CASE
WHEN tablespaces_total > 0 THEN
ROUND((tablespaces_streamed::numeric / tablespaces_total::numeric) * 100, 2)
ELSE 0
END AS tablespace_progress_pct
FROM pg_stat_progress_basebackup
),
backup_metrics AS (
SELECT
*,
ROUND(backup_streamed / 1024.0 / 1024.0 / 1024.0, 2) AS gb_streamed,
ROUND(backup_total / 1024.0 / 1024.0 / 1024.0, 2) AS gb_total,
-- 估算剩余时间(假设当前速度持续)
CASE
WHEN backup_streamed > 0 AND backup_total > backup_streamed THEN
ROUND(
(backup_total - backup_streamed) /
(backup_streamed / EXTRACT(epoch FROM (now() - pg_stat_get_backend_start(pid))) / 60),
1
)
ELSE 0
END AS eta_minutes
FROM backup_progress
)
SELECT
pid,
phase,
backup_progress_pct || '%' AS backup_progress,
gb_streamed || '/' || gb_total AS data_gb,
tablespace_progress_pct || '%' AS tablespace_progress,
tablespaces_streamed || '/' || tablespaces_total AS tablespaces,
eta_minutes || ' 分钟' AS estimated_remaining
FROM backup_metrics;
备份速度监控:
sql
-- 监控备份速度和性能指标
WITH backup_speed AS (
SELECT
pid,
phase,
backup_streamed,
EXTRACT(epoch FROM (now() - pg_stat_get_backend_start(pid))) AS elapsed_seconds,
backup_total
FROM pg_stat_progress_basebackup
)
SELECT
pid,
phase,
ROUND(backup_streamed / 1024.0 / 1024.0, 2) AS mb_completed,
ROUND(elapsed_seconds / 60, 1) AS elapsed_minutes,
ROUND(
backup_streamed / 1024.0 / 1024.0 / elapsed_seconds, 2
) AS mb_per_second,
ROUND(
backup_streamed / 1024.0 / 1024.0 / elapsed_seconds * 60, 2
) AS mb_per_minute,
CASE
WHEN backup_total > 0 THEN
ROUND(
(backup_total - backup_streamed) /
(backup_streamed / elapsed_seconds) / 60, 1
)
ELSE 0
END AS eta_minutes
FROM backup_speed
WHERE elapsed_seconds > 0;
阶段解释:
各阶段详细说明
初始化阶段:
- WAL 发送进程准备开始备份
- 通常非常短暂,几秒钟内完成
等待检查点完成:
- 执行
pg_backup_start
准备备份 - 等待备份开始检查点完成
- 确保备份的一致性
估计备份大小:
- 计算需要传输的数据总量
- 为进度计算提供基准
流式传输数据库文件:
- 主要的备份阶段
- 传输所有数据库文件
- 占用大部分备份时间
等待 WAL 归档完成:
- 确保备份期间的 WAL 文件已归档
- 保证备份的完整性
传输 WAL 文件:
- 仅在使用
--wal-method=fetch
时执行 - 传输备份期间生成的 WAL 文件
进度监控最佳实践
通用监控脚本
创建一个通用的进度监控脚本,能够监控所有类型的长时间运行操作:
sql
-- 通用进度监控视图
CREATE OR REPLACE VIEW v_operation_progress AS
SELECT
'ANALYZE' as operation_type,
pid,
datname,
phase,
COALESCE(
ROUND((sample_blks_scanned::numeric / NULLIF(sample_blks_total, 0)::numeric) * 100, 2),
0
) as progress_percentage,
sample_blks_scanned::text || '/' || sample_blks_total::text as detail
FROM pg_stat_progress_analyze
UNION ALL
SELECT
'VACUUM' as operation_type,
pid,
datname,
phase,
COALESCE(
ROUND((heap_blks_scanned::numeric / NULLIF(heap_blks_total, 0)::numeric) * 100, 2),
0
) as progress_percentage,
heap_blks_scanned::text || '/' || heap_blks_total::text as detail
FROM pg_stat_progress_vacuum
UNION ALL
SELECT
'CREATE INDEX' as operation_type,
pid,
datname,
phase,
COALESCE(
ROUND((blocks_done::numeric / NULLIF(blocks_total, 0)::numeric) * 100, 2),
0
) as progress_percentage,
blocks_done::text || '/' || blocks_total::text as detail
FROM pg_stat_progress_create_index
UNION ALL
SELECT
'CLUSTER' as operation_type,
pid,
datname,
phase,
COALESCE(
ROUND((heap_blks_scanned::numeric / NULLIF(heap_blks_total, 0)::numeric) * 100, 2),
0
) as progress_percentage,
heap_blks_scanned::text || '/' || heap_blks_total::text as detail
FROM pg_stat_progress_cluster
UNION ALL
SELECT
'COPY' as operation_type,
pid,
datname,
'processing' as phase,
COALESCE(
ROUND((bytes_processed::numeric / NULLIF(bytes_total, 0)::numeric) * 100, 2),
0
) as progress_percentage,
ROUND(bytes_processed/1024.0/1024.0, 2)::text || '/' ||
ROUND(bytes_total/1024.0/1024.0, 2)::text || ' MB' as detail
FROM pg_stat_progress_copy
UNION ALL
SELECT
'BASE BACKUP' as operation_type,
pid,
'backup' as datname,
phase,
COALESCE(
ROUND((backup_streamed::numeric / NULLIF(backup_total, 0)::numeric) * 100, 2),
0
) as progress_percentage,
ROUND(backup_streamed/1024.0/1024.0/1024.0, 2)::text || '/' ||
ROUND(backup_total/1024.0/1024.0/1024.0, 2)::text || ' GB' as detail
FROM pg_stat_progress_basebackup;
-- 使用通用监控视图
SELECT
operation_type,
pid,
datname,
phase,
progress_percentage || '%' as progress,
detail,
pg_stat_get_backend_start(pid) as start_time,
now() - pg_stat_get_backend_start(pid) as elapsed_time
FROM v_operation_progress
ORDER BY operation_type, pid;
自动化监控脚本
bash
#!/bin/bash
# progress_monitor.sh - PostgreSQL 操作进度监控脚本
PGHOST=${PGHOST:-localhost}
PGPORT=${PGPORT:-5432}
PGUSER=${PGUSER:-postgres}
PGDATABASE=${PGDATABASE:-postgres}
echo "PostgreSQL 操作进度监控 - $(date)"
echo "========================================"
# 检查是否有正在运行的长时间操作
psql -h $PGHOST -p $PGPORT -U $PGUSER -d $PGDATABASE -c "
SELECT
operation_type,
pid,
datname,
phase,
progress_percentage || '%' as progress,
detail,
EXTRACT(epoch FROM (now() - pg_stat_get_backend_start(pid)))::int as elapsed_seconds
FROM v_operation_progress
ORDER BY operation_type, pid;
" 2>/dev/null
# 如果没有操作在运行,显示提示信息
if [ $? -eq 0 ]; then
COUNT=$(psql -h $PGHOST -p $PGPORT -U $PGUSER -d $PGDATABASE -t -c "
SELECT COUNT(*) FROM v_operation_progress;
" 2>/dev/null | tr -d ' ')
if [ "$COUNT" = "0" ]; then
echo "当前没有长时间运行的操作"
fi
fi
告警和通知
sql
-- 创建监控函数,当操作运行时间过长时发送告警
CREATE OR REPLACE FUNCTION check_long_running_operations(
max_minutes INTEGER DEFAULT 60
) RETURNS TABLE(
operation_type TEXT,
pid INTEGER,
database_name TEXT,
elapsed_minutes NUMERIC,
progress_pct NUMERIC,
phase TEXT
) AS $$
BEGIN
RETURN QUERY
SELECT
op.operation_type::TEXT,
op.pid::INTEGER,
op.datname::TEXT,
ROUND(EXTRACT(epoch FROM (now() - pg_stat_get_backend_start(op.pid))) / 60, 1) as elapsed_minutes,
op.progress_percentage,
op.phase::TEXT
FROM v_operation_progress op
WHERE EXTRACT(epoch FROM (now() - pg_stat_get_backend_start(op.pid))) / 60 > max_minutes;
END;
$$ LANGUAGE plpgsql;
-- 检查运行超过2小时的操作
SELECT * FROM check_long_running_operations(120);
监控建议
- 定期检查:每 5-10 分钟检查一次进度
- 设置告警:对运行时间异常长的操作设置告警
- 资源监控:结合系统资源使用情况(CPU、内存、I/O)
- 历史记录:保存历史进度数据用于性能分析
- 自动化:将监控集成到现有的监控系统中
总结
PostgreSQL 的进度报告功能为数据库管理员提供了强大的运维工具:
- 实时可见性:了解长时间运行操作的当前状态
- 性能优化:识别瓶颈并调整配置参数
- 容量规划:基于历史数据预估未来操作时间
- 用户体验:为应用层提供准确的进度信息
- 故障诊断:快速识别异常情况并采取行动
通过合理使用这些进度报告功能,可以显著提高数据库运维效率和系统可靠性。