Appearance
PostgreSQL 窗口函数详解
窗口函数是 PostgreSQL 中一个强大的特性,它提供了在与当前查询行相关的行集上执行计算的能力。与普通聚合函数不同,窗口函数不会将结果集折叠为单行,而是为每行返回一个值。
什么是窗口函数
窗口函数执行与当前行相关的行集(称为"窗口")上的计算。它们必须使用窗口函数语法调用,即需要 OVER
子句。
INFO
窗口函数与聚合函数的区别:
- 聚合函数:将多行数据压缩为单个结果
- 窗口函数:为每行返回一个基于相关行集的计算结果
窗口函数的基本语法
sql
function_name([arguments]) OVER (
[PARTITION BY expression]
[ORDER BY expression [ASC|DESC]]
[frame_specification]
)
语法组件说明
内置窗口函数
1. 行号和排名函数
row_number() - 行号
为分区内的每行分配唯一的序号,从 1 开始。
示例:员工工资排序
sql
-- 创建示例表
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name VARCHAR(50),
department VARCHAR(50),
salary DECIMAL(10,2)
);
-- 插入测试数据
INSERT INTO employees (name, department, salary) VALUES
('张三', 'IT', 8000),
('李四', 'IT', 9000),
('王五', 'IT', 7500),
('赵六', 'HR', 6000),
('孙七', 'HR', 6500),
('周八', 'HR', 5500);
-- 使用 row_number() 为每个部门的员工按工资排序
SELECT
name,
department,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num
FROM employees;
输出结果:
name | department | salary | row_num |
---|---|---|---|
李四 | IT | 9000.00 | 1 |
张三 | IT | 8000.00 | 2 |
王五 | IT | 7500.00 | 3 |
孙七 | HR | 6500.00 | 1 |
赵六 | HR | 6000.00 | 2 |
周八 | HR | 5500.00 | 3 |
分析过程:
PARTITION BY department
:按部门分组ORDER BY salary DESC
:在每个分组内按工资降序排列ROW_NUMBER()
:为每个分区内的行分配唯一序号
rank() - 排名(有间隙)
返回当前行的排名,相同值会得到相同排名,但会留下间隙。
sql
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_num
FROM employees;
dense_rank() - 密集排名(无间隙)
返回当前行的排名,相同值得到相同排名,但不留间隙。
sql
-- 添加相同工资的员工来演示区别
INSERT INTO employees (name, department, salary) VALUES
('钱九', 'IT', 8000); -- 与张三工资相同
SELECT
name,
department,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_num,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank_num
FROM employees
WHERE department = 'IT';
输出结果:
name | department | salary | row_num | rank_num | dense_rank_num |
---|---|---|---|---|---|
李四 | IT | 9000.00 | 1 | 1 | 1 |
张三 | IT | 8000.00 | 2 | 2 | 2 |
钱九 | IT | 8000.00 | 3 | 2 | 2 |
王五 | IT | 7500.00 | 4 | 4 | 3 |
TIP
排名函数对比
- ROW_NUMBER():始终返回唯一值,即使有相同数据
- RANK():相同值获得相同排名,后续排名跳跃(1,2,2,4)
- DENSE_RANK():相同值获得相同排名,后续排名连续(1,2,2,3)
2. 百分比函数
percent_rank() - 相对排名
计算公式:(rank - 1) / (分区总行数 - 1)
sql
SELECT
name,
salary,
PERCENT_RANK() OVER (ORDER BY salary) as percent_rank_val
FROM employees
WHERE department = 'IT';
cume_dist() - 累积分布
计算公式:(当前行之前或与当前行对等的行数) / (分区总行数)
sql
SELECT
name,
salary,
CUME_DIST() OVER (ORDER BY salary) as cume_dist_val
FROM employees
WHERE department = 'IT';
3. 分桶函数
ntile() - N 分位数
将分区中的行尽可能均匀地分配到指定数量的桶中。
sql
-- 将员工分为3个薪资等级
SELECT
name,
department,
salary,
NTILE(3) OVER (PARTITION BY department ORDER BY salary DESC) as salary_tier
FROM employees;
业务应用场景:
- 将客户按消费能力分为高、中、低三个等级
- 将产品按销量分为不同档次
- 性能评估中的等级划分
4. 位移函数
lag() - 向前查看
获取分区中当前行之前指定偏移量处的值。
sql
-- 计算每月销售额及环比增长
CREATE TABLE monthly_sales (
month_name VARCHAR(20),
sales_amount DECIMAL(10,2)
);
INSERT INTO monthly_sales VALUES
('2024-01', 100000),
('2024-02', 110000),
('2024-03', 105000),
('2024-04', 120000),
('2024-05', 125000);
SELECT
month_name,
sales_amount,
LAG(sales_amount, 1) OVER (ORDER BY month_name) as prev_month_sales,
sales_amount - LAG(sales_amount, 1) OVER (ORDER BY month_name) as month_growth,
ROUND(
(sales_amount - LAG(sales_amount, 1) OVER (ORDER BY month_name)) * 100.0 /
LAG(sales_amount, 1) OVER (ORDER BY month_name), 2
) as growth_rate_percent
FROM monthly_sales;
输出结果:
month_name | sales_amount | prev_month_sales | month_growth | growth_rate_percent |
---|---|---|---|---|
2024-01 | 100000.00 | NULL | NULL | NULL |
2024-02 | 110000.00 | 100000.00 | 10000.00 | 10.00 |
2024-03 | 105000.00 | 110000.00 | -5000.00 | -4.55 |
2024-04 | 120000.00 | 105000.00 | 15000.00 | 14.29 |
2024-05 | 125000.00 | 120000.00 | 5000.00 | 4.17 |
lead() - 向后查看
获取分区中当前行之后指定偏移量处的值。
sql
-- 预测下个月销售目标
SELECT
month_name,
sales_amount,
LEAD(sales_amount, 1) OVER (ORDER BY month_name) as next_month_sales,
CASE
WHEN LEAD(sales_amount, 1) OVER (ORDER BY month_name) IS NOT NULL
THEN '已知'
ELSE '待预测'
END as status
FROM monthly_sales;
5. 帧内值函数
first_value() - 第一个值
返回窗口帧中第一行的值。
sql
-- 找出每个部门工资最高的员工姓名
SELECT
name,
department,
salary,
FIRST_VALUE(name) OVER (
PARTITION BY department
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as highest_paid_employee
FROM employees;
last_value() - 最后一个值
返回窗口帧中最后一行的值。
sql
-- 找出每个部门工资最低的员工姓名
SELECT
name,
department,
salary,
LAST_VALUE(name) OVER (
PARTITION BY department
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as lowest_paid_employee
FROM employees;
WARNING
窗口帧注意事项默认的窗口帧是从分区开始到当前行的最后对等行。对于 last_value()
函数,这可能不会返回期望的结果。建议使用 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
来包含整个分区。
nth_value() - 第 N 个值
返回窗口帧中第 N 行的值。
sql
-- 获取每个部门薪资第二高的员工
SELECT
name,
department,
salary,
NTH_VALUE(name, 2) OVER (
PARTITION BY department
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as second_highest_paid
FROM employees;
窗口帧规范
窗口帧定义了窗口函数计算时考虑的行范围。
帧类型
- ROWS:基于物理行数
- RANGE:基于值的范围
- GROUPS:基于对等组
帧边界
sql
-- 示例:计算移动平均值
SELECT
month_name,
sales_amount,
-- 当前行及前两行的平均值(3个月移动平均)
AVG(sales_amount) OVER (
ORDER BY month_name
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as moving_avg_3_months,
-- 当前行及前后各一行的平均值
AVG(sales_amount) OVER (
ORDER BY month_name
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) as centered_avg_3_months
FROM monthly_sales;
帧边界选项表
边界类型 | 说明 |
---|---|
UNBOUNDED PRECEDING | 分区的开始 |
n PRECEDING | 当前行之前的第 n 行 |
CURRENT ROW | 当前行 |
n FOLLOWING | 当前行之后的第 n 行 |
UNBOUNDED FOLLOWING | 分区的结束 |
聚合函数作为窗口函数
任何普通聚合函数都可以与 OVER
子句一起使用作为窗口函数。
sql
-- 累计销售额和移动总计
SELECT
month_name,
sales_amount,
-- 累计总计
SUM(sales_amount) OVER (ORDER BY month_name) as cumulative_sales,
-- 最近3个月总计
SUM(sales_amount) OVER (
ORDER BY month_name
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as rolling_3_month_total,
-- 总体平均值
AVG(sales_amount) OVER () as overall_avg,
-- 当前累计平均值
AVG(sales_amount) OVER (ORDER BY month_name) as cumulative_avg
FROM monthly_sales;
实际业务应用案例
案例 1:销售排行榜
sql
-- 创建销售数据表
CREATE TABLE sales_data (
salesperson VARCHAR(50),
region VARCHAR(50),
quarter VARCHAR(10),
sales_amount DECIMAL(12,2)
);
INSERT INTO sales_data VALUES
('张三', '华北', '2024Q1', 250000),
('李四', '华北', '2024Q1', 300000),
('王五', '华南', '2024Q1', 280000),
('赵六', '华南', '2024Q1', 320000),
('张三', '华北', '2024Q2', 280000),
('李四', '华北', '2024Q2', 290000),
('王五', '华南', '2024Q2', 310000),
('赵六', '华南', '2024Q2', 340000);
-- 生成销售排行榜
SELECT
salesperson,
region,
quarter,
sales_amount,
-- 区域内排名
RANK() OVER (PARTITION BY region, quarter ORDER BY sales_amount DESC) as region_rank,
-- 全国排名
RANK() OVER (PARTITION BY quarter ORDER BY sales_amount DESC) as national_rank,
-- 与区域平均的差异
sales_amount - AVG(sales_amount) OVER (PARTITION BY region, quarter) as diff_from_region_avg,
-- 销售额占区域比例
ROUND(
sales_amount * 100.0 / SUM(sales_amount) OVER (PARTITION BY region, quarter),
2
) as region_share_percent
FROM sales_data
ORDER BY quarter, region, sales_amount DESC;
案例 2:客户价值分析
sql
-- 客户购买数据
CREATE TABLE customer_orders (
customer_id INTEGER,
order_date DATE,
order_amount DECIMAL(10,2)
);
-- 客户价值分析
WITH customer_metrics AS (
SELECT
customer_id,
COUNT(*) as total_orders,
SUM(order_amount) as total_spent,
AVG(order_amount) as avg_order_value,
MAX(order_date) as last_order_date,
MIN(order_date) as first_order_date
FROM customer_orders
GROUP BY customer_id
)
SELECT
customer_id,
total_orders,
total_spent,
avg_order_value,
-- 客户价值分层(RFM模型中的M-金额)
NTILE(5) OVER (ORDER BY total_spent DESC) as value_tier,
-- 消费排名
PERCENT_RANK() OVER (ORDER BY total_spent) as spending_percentile,
-- 与平均消费的比较
total_spent - AVG(total_spent) OVER () as diff_from_avg,
-- 订单频次排名
RANK() OVER (ORDER BY total_orders DESC) as frequency_rank
FROM customer_metrics;
性能优化建议
TIP
性能优化要点
- 索引优化:为
PARTITION BY
和ORDER BY
子句中的列创建索引 - 分区裁剪:合理使用
PARTITION BY
减少计算范围 - 窗口帧大小:避免使用过大的窗口帧,特别是
UNBOUNDED FOLLOWING
- 函数选择:根据实际需求选择合适的窗口函数
索引建议
sql
-- 为窗口函数优化创建索引
CREATE INDEX idx_employees_dept_salary ON employees (department, salary DESC);
CREATE INDEX idx_sales_month ON monthly_sales (month_name);
常见错误和注意事项
WARNING
常见陷阱
- 窗口帧理解错误:默认帧不包含
UNBOUNDED FOLLOWING
- NULL 值处理:窗口函数对 NULL 值的处理需要特别注意
- 性能问题:大数据量时窗口函数可能影响性能
- 排序一致性:确保 ORDER BY 子句的排序具有确定性
错误示例和修正
sql
-- ❌ 错误:last_value可能不返回期望结果
SELECT
name,
salary,
LAST_VALUE(salary) OVER (ORDER BY salary) as max_salary_wrong
FROM employees;
-- ✅ 正确:明确指定窗口帧
SELECT
name,
salary,
LAST_VALUE(salary) OVER (
ORDER BY salary
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as max_salary_correct
FROM employees;
总结
窗口函数是 PostgreSQL 中处理分析型查询的强大工具,它们允许你:
- 🔍 灵活分析:在保持行级详细信息的同时进行聚合计算
- 📊 排名分析:实现各种排名和百分比计算
- 📈 趋势分析:计算移动平均、累计值等时间序列指标
- 🎯 分组比较:在不同分区间进行比较分析
通过掌握窗口函数,你可以编写更高效、更优雅的 SQL 查询,处理复杂的业务分析需求。
掌握这些概念和技巧,你就能够在实际工作中有效运用 PostgreSQL 的窗口函数解决复杂的数据分析问题。