PostgreSQL | DISTINCT ON解析

Jun 22, 2025

基本概念

DISTINCT ON 是PostgreSQL特有的去重功能,允许在保留每组中最早/最晚/特定记录时去重。

语法解析

SELECT DISTINCT ON (column1, column2, ...) 
    column1, 
    column2, 
    ...
FROM table
ORDER BY column1, column2, ..., additional_sorting_column;

详细示例和解释

1. 基础示例

-- 原始数据
player_id | event_date  | other_column
----------|--------------|--------------
1         | 2023-01-01  | A
1         | 2023-01-02  | B
1         | 2023-01-03  | C
2         | 2024-01-01  | X
2         | 2024-01-02  | Y

-- DISTINCT ON 查询
SELECT DISTINCT ON (player_id) 
    player_id, 
    event_date,
    other_column
FROM activity
ORDER BY player_id, event_date;

-- 结果
player_id | event_date  | other_column
----------|--------------|--------------
1         | 2023-01-01  | A  
2         | 2024-01-01  | X

2. 执行流程详解

步骤分析

  1. DISTINCT ON (player_id)

    • 按player_id分组
    • 每组只保留一行
  2. ORDER BY player_id, event_date

    • 先按player_id排序
    • 再按event_date排序
    • 决定每组保留的具体行

3. 复杂示例

-- 查找每个玩家最早的登录记录
SELECT DISTINCT ON (player_id) 
    player_id, 
    event_date AS first_login,
    server_id,
    login_ip
FROM activity
ORDER BY player_id, event_date ASC, server_id;

4. 变体用法

a. 多列DISTINCT ON

-- 按国家和城市去重
SELECT DISTINCT ON (country, city)
    country, 
    city, 
    population
FROM cities
ORDER BY country, city, population DESC;

b. 获取每组最大值

-- 获取每个类别最高分数的记录
SELECT DISTINCT ON (category)
    category,
    product_name,
    price
FROM products
ORDER BY category, price DESC;

性能和实现

内部执行机制

1. 排序 (ORDER BY)
2. 去重 (DISTINCT ON)
3. 选择每组第一行

性能对比

-- 方法1: DISTINCT ON (PostgreSQL)
SELECT DISTINCT ON (player_id) 
    player_id, 
    event_date
FROM activity
ORDER BY player_id, event_date;

-- 方法2: 子查询 
SELECT player_id, MIN(event_date) AS event_date
FROM activity
GROUP BY player_id;

-- 方法3: 窗口函数
SELECT player_id, event_date
FROM (
    SELECT 
        player_id, 
        event_date,
        ROW_NUMBER() OVER (
            PARTITION BY player_id 
            ORDER BY event_date
        ) AS rn
    FROM activity
) ranked
WHERE rn = 1;

性能优化

-- 创建复合索引
CREATE INDEX idx_player_date 
ON activity(player_id, event_date);

高级用法

1. 复杂条件选择

-- 选择每个玩家最早的高分登录记录
SELECT DISTINCT ON (player_id)
    player_id,
    event_date,
    score
FROM activity
WHERE score > 100
ORDER BY player_id, event_date, score DESC;

2. 多维度去重

-- 按年份和月份去重
SELECT DISTINCT ON (
    EXTRACT(YEAR FROM event_date), 
    EXTRACT(MONTH FROM event_date)
)
    EXTRACT(YEAR FROM event_date) AS year,
    EXTRACT(MONTH FROM event_date) AS month,
    AVG(score) AS avg_score
FROM activity
ORDER BY 
    EXTRACT(YEAR FROM event_date), 
    EXTRACT(MONTH FROM event_date),
    AVG(score) DESC;

常见使用场景

1. 数据去重

-- 保留最新的用户信息
SELECT DISTINCT ON (email)
    email,
    name,
    last_login
FROM users
ORDER BY email, last_login DESC;

2. 时间序列分析

-- 每个产品最后一次销售
SELECT DISTINCT ON (product_id)
    product_id,
    sale_date,
    total_amount
FROM sales
ORDER BY product_id, sale_date DESC;

3. 日志分析

-- 每个IP最后一次访问
SELECT DISTINCT ON (client_ip)
    client_ip,
    request_time,
    request_path
FROM access_logs
ORDER BY client_ip, request_time DESC;

注意事项

优点

  • 简洁明了
  • 性能高效
  • PostgreSQL特有功能

局限性

  • 仅PostgreSQL支持
  • 需要明确的排序规则
  • 复杂场景可能需要子查询

最佳实践

  1. 明确排序列
  2. 考虑性能
  3. 创建合适索引
  4. 理解选择逻辑

推荐场景

  • 去重
  • 获取最新/最早记录
  • 简单的分组选择

DISTINCT ON 是PostgreSQL中非常强大且灵活的去重工具,理解其工作原理可以显著简化查询逻辑。


  • Claude Sonnet 4
https://inasa.dev/posts/rss.xml