PostgreSQL | DISTINCT ON解析
Jun 22, 2025
基本概念
DISTINCT ON 是PostgreSQL特有的去重功能,允许在保留每组中最早/最晚/特定记录时去重。
语法解析
SELECT DISTINCT ON (column1, column2, ...)
column1,
column2,
...
FROM table
ORDER BY column1, column2, ..., additional_sorting_column;
详细示例和解释
1. 基础示例
-- 原始数据
player_id | event_date | other_column
----------|--------------|--------------
1 | 2023-01-01 | A
1 | 2023-01-02 | B
1 | 2023-01-03 | C
2 | 2024-01-01 | X
2 | 2024-01-02 | Y
-- DISTINCT ON 查询
SELECT DISTINCT ON (player_id)
player_id,
event_date,
other_column
FROM activity
ORDER BY player_id, event_date;
-- 结果
player_id | event_date | other_column
----------|--------------|--------------
1 | 2023-01-01 | A
2 | 2024-01-01 | X
2. 执行流程详解
步骤分析
-
DISTINCT ON (player_id)
- 按player_id分组
- 每组只保留一行
-
ORDER BY player_id, event_date
- 先按player_id排序
- 再按event_date排序
- 决定每组保留的具体行
3. 复杂示例
-- 查找每个玩家最早的登录记录
SELECT DISTINCT ON (player_id)
player_id,
event_date AS first_login,
server_id,
login_ip
FROM activity
ORDER BY player_id, event_date ASC, server_id;
4. 变体用法
a. 多列DISTINCT ON
-- 按国家和城市去重
SELECT DISTINCT ON (country, city)
country,
city,
population
FROM cities
ORDER BY country, city, population DESC;
b. 获取每组最大值
-- 获取每个类别最高分数的记录
SELECT DISTINCT ON (category)
category,
product_name,
price
FROM products
ORDER BY category, price DESC;
性能和实现
内部执行机制
1. 排序 (ORDER BY)
2. 去重 (DISTINCT ON)
3. 选择每组第一行
性能对比
-- 方法1: DISTINCT ON (PostgreSQL)
SELECT DISTINCT ON (player_id)
player_id,
event_date
FROM activity
ORDER BY player_id, event_date;
-- 方法2: 子查询
SELECT player_id, MIN(event_date) AS event_date
FROM activity
GROUP BY player_id;
-- 方法3: 窗口函数
SELECT player_id, event_date
FROM (
SELECT
player_id,
event_date,
ROW_NUMBER OVER (
PARTITION BY player_id
ORDER BY event_date
) AS rn
FROM activity
) ranked
WHERE rn = 1;
性能优化
-- 创建复合索引
ON activity(player_id, event_date);
高级用法
1. 复杂条件选择
-- 选择每个玩家最早的高分登录记录
SELECT DISTINCT ON (player_id)
player_id,
event_date,
score
FROM activity
WHERE score > 100
ORDER BY player_id, event_date, score DESC;
2. 多维度去重
-- 按年份和月份去重
SELECT DISTINCT ON (
EXTRACT(YEAR FROM event_date),
EXTRACT(MONTH FROM event_date)
)
EXTRACT(YEAR FROM event_date) AS year,
EXTRACT(MONTH FROM event_date) AS month,
AVG(score) AS avg_score
FROM activity
ORDER BY
EXTRACT(YEAR FROM event_date),
EXTRACT(MONTH FROM event_date),
AVG(score) DESC;
常见使用场景
1. 数据去重
-- 保留最新的用户信息
SELECT DISTINCT ON (email)
email,
name,
last_login
FROM users
ORDER BY email, last_login DESC;
2. 时间序列分析
-- 每个产品最后一次销售
SELECT DISTINCT ON (product_id)
product_id,
sale_date,
total_amount
FROM sales
ORDER BY product_id, sale_date DESC;
3. 日志分析
-- 每个IP最后一次访问
SELECT DISTINCT ON (client_ip)
client_ip,
request_time,
request_path
FROM access_logs
ORDER BY client_ip, request_time DESC;
注意事项
优点
- 简洁明了
- 性能高效
- PostgreSQL特有功能
局限性
- 仅PostgreSQL支持
- 需要明确的排序规则
- 复杂场景可能需要子查询
最佳实践
- 明确排序列
- 考虑性能
- 创建合适索引
- 理解选择逻辑
推荐场景
- 去重
- 获取最新/最早记录
- 简单的分组选择
DISTINCT ON 是PostgreSQL中非常强大且灵活的去重工具,理解其工作原理可以显著简化查询逻辑。
- Claude Sonnet 4