Appearance
PostgreSQL 聚合函数详解
概述
聚合函数是 PostgreSQL 中从一组输入值计算单个结果的特殊函数。它们在数据分析、报表生成和统计计算中发挥着至关重要的作用。PostgreSQL 提供了丰富的内置聚合函数,包括通用聚合函数、统计聚合函数、有序集聚合函数等。
INFO
聚合函数的特点
- 输入多行,输出单行:从多个输入值计算出一个结果
- 支持部分模式:某些聚合函数支持并行聚合优化
- 处理 NULL 值:大多数聚合函数会忽略 NULL 值
- 可选排序:支持 ORDER BY 子句控制输入顺序
聚合函数工作原理
通用聚合函数
1. 计数函数 (COUNT)
基本用法
sql
-- 计算所有行数(包括 NULL 值)
COUNT(*)
-- 计算非空值行数
COUNT(column_name)
实际业务场景:电商订单统计
假设我们有一个订单表:
sql
-- 创建订单表
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER,
order_date DATE,
total_amount DECIMAL(10,2),
status VARCHAR(20)
);
-- 插入测试数据
INSERT INTO orders (customer_id, order_date, total_amount, status) VALUES
(1, '2024-01-15', 299.99, 'completed'),
(2, '2024-01-16', NULL, 'pending'),
(1, '2024-01-17', 159.50, 'completed'),
(3, '2024-01-18', 89.99, 'cancelled'),
(2, '2024-01-19', 449.99, 'completed');
示例分析:
sql
-- 统计总订单数
SELECT COUNT(*) as total_orders FROM orders;
-- 输出:5
-- 统计有金额的订单数
SELECT COUNT(total_amount) as orders_with_amount FROM orders;
-- 输出:4(排除了 NULL 值)
-- 按状态统计订单数
SELECT status, COUNT(*) as order_count
FROM orders
GROUP BY status;
预期输出: | status | order_count | |--------|-------------| | completed | 3 | | pending | 1 | | cancelled | 1 |
TIP
性能提示 COUNT(*)
比 COUNT(column_name)
性能更好,因为它不需要检查列值是否为 NULL。
2. 求和函数 (SUM)
语法格式
sql
SUM(numeric_expression)
业务场景:销售额统计
sql
-- 计算总销售额
SELECT SUM(total_amount) as total_sales FROM orders;
-- 输出:999.47
-- 按客户统计销售额
SELECT
customer_id,
SUM(total_amount) as customer_total,
COUNT(*) as order_count
FROM orders
WHERE total_amount IS NOT NULL
GROUP BY customer_id
ORDER BY customer_total DESC;
预期输出: | customer_id | customer_total | order_count | |-------------|----------------|-------------| | 2 | 449.99 | 1 | | 1 | 459.49 | 2 | | 3 | 89.99 | 1 |
WARNING
注意事项当没有匹配行时,SUM 返回 NULL 而不是 0。如需返回 0,使用 COALESCE(SUM(column), 0)
。
3. 平均值函数 (AVG)
业务场景:客户平均订单金额
sql
-- 计算平均订单金额
SELECT
ROUND(AVG(total_amount), 2) as avg_order_amount
FROM orders
WHERE total_amount IS NOT NULL;
-- 输出:249.87
-- 按月统计平均订单金额
SELECT
EXTRACT(MONTH FROM order_date) as month,
ROUND(AVG(total_amount), 2) as avg_monthly_amount,
COUNT(*) as orders_count
FROM orders
WHERE total_amount IS NOT NULL
GROUP BY EXTRACT(MONTH FROM order_date)
ORDER BY month;
4. 最大值和最小值函数 (MAX/MIN)
业务场景:价格区间分析
sql
-- 查找订单金额的最大值和最小值
SELECT
MAX(total_amount) as highest_order,
MIN(total_amount) as lowest_order,
MAX(total_amount) - MIN(total_amount) as price_range
FROM orders;
预期输出: | highest_order | lowest_order | price_range | |---------------|--------------|-------------| | 449.99 | 89.99 | 360.00 |
5. 数组聚合函数 (ARRAY_AGG)
业务场景:客户订单 ID 收集
sql
-- 收集每个客户的所有订单ID
SELECT
customer_id,
ARRAY_AGG(order_id ORDER BY order_date) as order_ids,
ARRAY_AGG(total_amount ORDER BY order_date) as order_amounts
FROM orders
GROUP BY customer_id;
预期输出示例: | customer_id | order_ids | order_amounts | |-------------|-----------|---------------| | 1 | {1,3} | {299.99,159.50} | | 2 | {2,5} | {NULL,449.99} | | 3 | {4} | {89.99} |
6. 字符串聚合函数 (STRING_AGG)
业务场景:状态摘要生成
sql
-- 为每个客户生成订单状态摘要
SELECT
customer_id,
STRING_AGG(DISTINCT status, ', ' ORDER BY status) as status_summary,
COUNT(*) as total_orders
FROM orders
GROUP BY customer_id;
预期输出: | customer_id | status_summary | total_orders | |-------------|----------------|--------------| | 1 | completed | 2 | | 2 | completed, pending | 2 | | 3 | cancelled | 1 |
7. JSON 聚合函数
JSON_AGG - 创建 JSON 数组
sql
-- 为每个客户创建订单详情的 JSON 数组
SELECT
customer_id,
JSON_AGG(
JSON_BUILD_OBJECT(
'order_id', order_id,
'date', order_date,
'amount', total_amount,
'status', status
) ORDER BY order_date
) as order_details
FROM orders
GROUP BY customer_id;
JSONB_OBJECT_AGG - 创建 JSON 对象
sql
-- 创建订单ID到金额的映射对象
SELECT
customer_id,
JSONB_OBJECT_AGG(order_id::text, total_amount) as order_amounts_map
FROM orders
WHERE total_amount IS NOT NULL
GROUP BY customer_id;
8. 布尔聚合函数
业务场景:客户订单状态检查
sql
-- 检查客户是否有完成的订单和是否有取消的订单
SELECT
customer_id,
BOOL_OR(status = 'completed') as has_completed_orders,
BOOL_AND(status != 'cancelled') as no_cancelled_orders,
COUNT(*) as total_orders
FROM orders
GROUP BY customer_id;
预期输出: | customer_id | has_completed_orders | no_cancelled_orders | total_orders | |-------------|---------------------|---------------------|--------------| | 1 | true | true | 2 | | 2 | true | true | 2 | | 3 | false | false | 1 |
统计聚合函数
统计聚合函数主要用于数据分析和统计计算,在商业智能和数据科学应用中非常重要。
标准差函数 (STDDEV)
业务场景:订单金额波动性分析
sql
-- 分析订单金额的波动性
SELECT
ROUND(STDDEV(total_amount), 2) as amount_stddev,
ROUND(AVG(total_amount), 2) as avg_amount,
ROUND(STDDEV(total_amount) / AVG(total_amount) * 100, 2) as coefficient_variation
FROM orders
WHERE total_amount IS NOT NULL;
相关性分析函数 (CORR)
业务场景:分析客户 ID 与订单金额的相关性
sql
-- 假设我们有更多的客户订单数据
WITH customer_stats AS (
SELECT
customer_id,
AVG(total_amount) as avg_amount,
COUNT(*) as order_frequency
FROM orders
WHERE total_amount IS NOT NULL
GROUP BY customer_id
)
SELECT
ROUND(CORR(customer_id, avg_amount), 4) as id_amount_correlation,
ROUND(CORR(customer_id, order_frequency), 4) as id_frequency_correlation
FROM customer_stats;
协方差函数 (COVAR_POP, COVAR_SAMP)
sql
-- 计算客户ID与订单金额的协方差
WITH customer_stats AS (
SELECT
customer_id,
total_amount
FROM orders
WHERE total_amount IS NOT NULL
)
SELECT
ROUND(COVAR_POP(customer_id, total_amount), 2) as population_covariance,
ROUND(COVAR_SAMP(customer_id, total_amount), 2) as sample_covariance
FROM customer_stats;
位运算聚合函数
位运算聚合函数在权限管理、标志位处理等场景中非常有用。
业务场景:权限系统
sql
-- 创建用户权限表
CREATE TABLE user_permissions (
user_id INTEGER,
permission_flags INTEGER -- 使用位标志存储权限
);
-- 插入权限数据 (1=读取, 2=写入, 4=删除, 8=管理)
INSERT INTO user_permissions VALUES
(1, 1), (1, 2), (1, 4), -- 用户1有读、写、删权限
(2, 1), (2, 2), -- 用户2有读、写权限
(3, 1), (3, 8); -- 用户3有读、管理权限
-- 计算每个用户的综合权限
SELECT
user_id,
BIT_OR(permission_flags) as combined_permissions,
-- 检查具体权限
(BIT_OR(permission_flags) & 1) > 0 as can_read,
(BIT_OR(permission_flags) & 2) > 0 as can_write,
(BIT_OR(permission_flags) & 4) > 0 as can_delete,
(BIT_OR(permission_flags) & 8) > 0 as can_manage
FROM user_permissions
GROUP BY user_id;
窗口函数与聚合函数结合
业务场景:订单趋势分析
sql
-- 分析订单金额的累计趋势和移动平均
SELECT
order_date,
total_amount,
-- 累计销售额
SUM(total_amount) OVER (ORDER BY order_date) as cumulative_sales,
-- 3日移动平均
ROUND(AVG(total_amount) OVER (
ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 2) as moving_avg_3days,
-- 排名
RANK() OVER (ORDER BY total_amount DESC) as amount_rank
FROM orders
WHERE total_amount IS NOT NULL
ORDER BY order_date;
高级聚合应用
1. 条件聚合
业务场景:多状态统计
sql
-- 使用条件聚合统计不同状态的订单
SELECT
customer_id,
COUNT(*) as total_orders,
COUNT(*) FILTER (WHERE status = 'completed') as completed_orders,
COUNT(*) FILTER (WHERE status = 'pending') as pending_orders,
COUNT(*) FILTER (WHERE status = 'cancelled') as cancelled_orders,
-- 计算完成率
ROUND(
COUNT(*) FILTER (WHERE status = 'completed') * 100.0 / COUNT(*),
2
) as completion_rate
FROM orders
GROUP BY customer_id;
2. 嵌套聚合
业务场景:客户分层分析
sql
-- 分析客户消费水平分布
WITH customer_totals AS (
SELECT
customer_id,
SUM(total_amount) as total_spent,
COUNT(*) as order_count
FROM orders
WHERE total_amount IS NOT NULL
GROUP BY customer_id
),
customer_tiers AS (
SELECT
customer_id,
total_spent,
order_count,
CASE
WHEN total_spent >= 400 THEN 'VIP'
WHEN total_spent >= 200 THEN 'Premium'
ELSE 'Regular'
END as tier
FROM customer_totals
)
SELECT
tier,
COUNT(*) as customer_count,
ROUND(AVG(total_spent), 2) as avg_spent_per_tier,
ROUND(AVG(order_count), 2) as avg_orders_per_tier
FROM customer_tiers
GROUP BY tier
ORDER BY avg_spent_per_tier DESC;
性能优化策略
1. 索引优化
sql
-- 为聚合查询创建合适的索引
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
CREATE INDEX idx_orders_status_amount ON orders(status, total_amount)
WHERE total_amount IS NOT NULL;
2. 部分聚合利用
INFO
部分模式聚合支持部分模式的聚合函数可以利用并行处理来提高性能。以下函数支持部分模式:
COUNT
,SUM
,AVG
MIN
,MAX
ARRAY_AGG
,STRING_AGG
- 所有位运算聚合函数
- 所有布尔聚合函数
- 所有统计聚合函数
3. 聚合查询优化技巧
sql
-- 使用 HAVING 而不是 WHERE 过滤聚合结果
SELECT
customer_id,
COUNT(*) as order_count,
SUM(total_amount) as total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1 AND SUM(total_amount) > 200;
-- 使用子查询优化复杂聚合
WITH monthly_stats AS (
SELECT
EXTRACT(MONTH FROM order_date) as month,
COUNT(*) as monthly_orders,
SUM(total_amount) as monthly_sales
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY EXTRACT(MONTH FROM order_date)
)
SELECT
AVG(monthly_orders) as avg_monthly_orders,
AVG(monthly_sales) as avg_monthly_sales
FROM monthly_stats;
常见陷阱和注意事项
1. NULL 值处理
WARNING
NULL 值陷阱
- 除了
COUNT(*)
外,大多数聚合函数忽略 NULL 值 - 当没有输入行时,聚合函数返回 NULL(COUNT 除外返回 0)
- 使用
COALESCE
处理空结果
sql
-- 安全的聚合查询
SELECT
customer_id,
COALESCE(SUM(total_amount), 0) as total_spent,
COALESCE(AVG(total_amount), 0) as avg_amount,
COUNT(CASE WHEN total_amount IS NOT NULL THEN 1 END) as valid_orders
FROM orders
GROUP BY customer_id;
2. 数据类型精度
sql
-- 注意数值精度问题
SELECT
-- 可能丢失精度
AVG(total_amount) as avg_simple,
-- 保持精度
SUM(total_amount) / COUNT(total_amount) as avg_precise,
-- 四舍五入到指定位数
ROUND(AVG(total_amount), 2) as avg_rounded
FROM orders;
3. 排序对结果的影响
sql
-- 对于顺序敏感的聚合函数,必须使用 ORDER BY
SELECT
customer_id,
-- 结果可能不确定
STRING_AGG(status, ', ') as status_unordered,
-- 结果确定
STRING_AGG(status, ', ' ORDER BY order_date) as status_ordered
FROM orders
GROUP BY customer_id;
实际项目应用模式
1. 报表生成模式
sql
-- 综合销售报表
WITH daily_stats AS (
SELECT
order_date,
COUNT(*) as daily_orders,
SUM(total_amount) as daily_sales,
AVG(total_amount) as daily_avg,
COUNT(DISTINCT customer_id) as daily_customers
FROM orders
WHERE total_amount IS NOT NULL
GROUP BY order_date
)
SELECT
order_date,
daily_orders,
daily_sales,
ROUND(daily_avg, 2) as daily_avg,
daily_customers,
-- 累计指标
SUM(daily_orders) OVER (ORDER BY order_date) as cumulative_orders,
SUM(daily_sales) OVER (ORDER BY order_date) as cumulative_sales,
-- 移动平均
ROUND(AVG(daily_sales) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2) as weekly_moving_avg
FROM daily_stats
ORDER BY order_date;
2. 数据质量检查模式
sql
-- 数据质量分析
SELECT
'orders' as table_name,
COUNT(*) as total_rows,
COUNT(customer_id) as non_null_customer_id,
COUNT(total_amount) as non_null_amount,
COUNT(*) - COUNT(total_amount) as null_amounts,
ROUND(
(COUNT(*) - COUNT(total_amount)) * 100.0 / COUNT(*),
2
) as null_percentage,
MIN(order_date) as earliest_order,
MAX(order_date) as latest_order,
COUNT(DISTINCT customer_id) as unique_customers
FROM orders;
3. 异常检测模式
sql
-- 异常订单检测
WITH stats AS (
SELECT
AVG(total_amount) as mean_amount,
STDDEV(total_amount) as stddev_amount
FROM orders
WHERE total_amount IS NOT NULL
),
outliers AS (
SELECT
o.*,
ABS(o.total_amount - s.mean_amount) / s.stddev_amount as z_score
FROM orders o
CROSS JOIN stats s
WHERE o.total_amount IS NOT NULL
)
SELECT
COUNT(*) FILTER (WHERE z_score > 2) as moderate_outliers,
COUNT(*) FILTER (WHERE z_score > 3) as extreme_outliers,
ARRAY_AGG(order_id ORDER BY z_score DESC) FILTER (WHERE z_score > 3) as extreme_outlier_ids
FROM outliers;
通过以上全面的学习和实践,您将能够熟练掌握 PostgreSQL 聚合函数的使用,在实际项目中灵活运用这些强大的数据处理工具。
在生产环境中使用聚合函数时,始终考虑性能影响,合理使用索引,并注意处理 NULL 值和数据类型精度问题。