获得每组的前1行-Java 学习之路

403

我有一张 table ，我希望得到每组的最新条目 . 这是表格：

DocumentStatusLogs 表

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

该表将按 DocumentID 分组，并按 DateCreated 降序排序 . 对于每个 DocumentID ，我想获得最新状态 .

我的首选输出：

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |

是否有任何聚合函数只能从每个组获得顶部？请参阅下面的伪代码 GetOnlyTheTop ：

SELECT
  DocumentID,
  GetOnlyTheTop(Status),
  GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC

如果这样的功能不存在，有什么办法可以实现我想要的输出吗？
或者首先，这可能是由非标准化数据库引起的吗？我在'm thinking, since what I'找的只是一行，那个 status 也应该位于父表中吗？

有关更多信息，请参阅父表：

目前 Documents 表

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

父表是否应该像这样，以便我可以轻松访问其状态？

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

UPDATE 我刚学会了如何使用"apply"，这样可以更容易地解决这些问题 .

16 回答

SELECT doc_id,status,date_created FROM (
SELECT a.*,Row_Number() OVER(PARTITION BY doc_id ORDER BY date_created DESC ) AS rnk FROM doc a)
WHERE rnk=1;

回复于 2024-05-05T23:17:28+08:00

19
我已经对这里的各种建议做了一些定时，结果实际上取决于所涉及的表的大小，但最一致的解决方案是使用CROSS APPLY这些测试是针对SQL Server 2008-R2运行的，使用的是6,500条记录，另一条（相同的架构），有1.37亿条记录 . 被查询的列是表上主键的一部分，表宽度非常小（约30个字节） . SQL Server根据实际执行计划报告时间 .
```
Query                                  Time for 6500 (ms)    Time for 137M(ms)

CROSS APPLY                                    17.9                17.9
SELECT WHERE col = (SELECT MAX(COL)…)           6.6               854.4
DENSE_RANK() OVER PARTITION                     6.6               907.1
```
我认为真正令人惊奇的是CROSS APPLY的时间是多么一致，无论涉及的行数如何 .
回复于 2024-05-05T23:17:28+08:00
0
我知道这是一个旧线程，但是 TOP 1 WITH TIES 解决方案非常好，可能对某些阅读解决方案有所帮助 .
```
select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
```
有关TOP子句的更多信息，请参见here .
回复于 2024-05-05T23:17:28+08:00
5
在SQLite中检查您可以对GROUP BY使用以下简单查询
```
SELECT MAX(DateCreated), *
FROM DocumentStatusLogs
GROUP BY DocumentID
```
这里MAX有助于获得每组最大DateCreated FROM .

但似乎MYSQL没有将* -columns与max DateCreated的值相关联:(
回复于 2024-05-05T23:17:28+08:00

-2

我刚刚学会了如何使用 cross apply . 以下是在此方案中使用它的方法：

select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds

回复于 2024-05-05T23:17:28+08:00

592

试试这个：

SELECT [DocumentID], 
        [tmpRez].value('/x[2]','varchar(20)') as [Status],
 [tmpRez].value('/x[3]','datetime') as [DateCreated] 
FROM (
        SELECT [DocumentID],
    cast('<x>'+max(cast([ID] as varchar(10))+'</x><x>'+[Status]+'</x><x>'
    +cast([DateCreated] as varchar(20)))+'</x>' as XML) as [tmpRez]
        FROM DocumentStatusLogs
        GROUP by DocumentID) as [tmpQry]

回复于 2024-05-05T23:17:28+08:00

SELECT o.*
FROM `DocumentStatusLogs` o                   
  LEFT JOIN `DocumentStatusLogs` b                   
  ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
 WHERE b.DocumentID is NULL ;

如果您只想通过DateCreated返回最近的文档订单，它将仅返回DocumentID的前1个文档

回复于 2024-05-05T23:17:28+08:00

-1
这是一个相当古老的线索，但是我认为我会把我的两分钱扔掉，因为接受的答案对我来说并不是特别好 . 我在一个大型数据集上尝试了gbn的解决方案，发现它非常慢（在SQL Server 2012中500万条记录上超过45秒） . 看一下执行计划，很明显问题是它需要一个SORT操作，这会大大减慢速度 .

这是我从实体框架中解除的一种替代方案，它不需要SORT操作并进行非聚集索引搜索 . 这将上述记录集的执行时间减少到<2秒 .
```
SELECT 
[Limit1].[DocumentID] AS [DocumentID], 
[Limit1].[Status] AS [Status], 
[Limit1].[DateCreated] AS [DateCreated]
FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
    FROM (SELECT 
        [Extent2].[ID] AS [ID], 
        [Extent2].[DocumentID] AS [DocumentID], 
        [Extent2].[Status] AS [Status], 
        [Extent2].[DateCreated] AS [DateCreated]
        FROM [dbo].[DocumentStatusLogs] AS [Extent2]
        WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
    )  AS [Project2]
    ORDER BY [Project2].[ID] DESC) AS [Limit1]
```
现在我在原始问题中完全指定了'm assuming something that isn'，但如果你的表设计是你的ID列是一个自动增量ID，并且DateCreated设置为每个插入的当前日期，那么即使没有运行我的查询上面你实际上可以从 ordering on ID instead of ordering on DateCreated 获得相当大的性能提升gbn的解决方案（大约一半的执行时间），因为这将提供相同的排序顺序，并且它的排序更快 .
回复于 2024-05-05T23:17:28+08:00

从上面验证Clint的真棒和正确答案：

下面两个查询之间的表现很有趣 . 52％是最重要的一个 . 48％是第二个 . 使用DISTINCT而不是ORDER BY将性能提高4％ . 但是ORDER BY具有按多列排序的优势 .

IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END

CREATE TABLE #DocumentStatusLogs (
    [ID] int NOT NULL,
    [DocumentID] int NOT NULL,
    [Status] varchar(20),
    [DateCreated] datetime
)

INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')

选项1：

SELECT
    [Extent1].[ID], 
    [Extent1].[DocumentID],
    [Extent1].[Status], 
    [Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
    OUTER APPLY (
        SELECT TOP 1
            [Extent2].[ID], 
            [Extent2].[DocumentID],
            [Extent2].[Status], 
            [Extent2].[DateCreated]
        FROM #DocumentStatusLogs AS [Extent2]
        WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
        ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
    ) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])

选项2：

SELECT 
    [Limit1].[DocumentID] AS [ID], 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
FROM (
    SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
    OUTER APPLY  (
        SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (
            SELECT 
                [Extent2].[ID] AS [ID], 
                [Extent2].[DocumentID] AS [DocumentID], 
                [Extent2].[Status] AS [Status], 
                [Extent2].[DateCreated] AS [DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC
    ) AS [Limit1]

M $的Management Studio：突出显示并运行第一个块后，突出显示选项1和选项2，右键单击 - > [显示估计执行计划] . 然后运行整个过程以查看结果 .

选项1结果：

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

选项2结果：

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

注意：

当我希望连接为1对1（多个）时，我倾向于使用APPLY . 如果我希望连接是1对多或多对多，我使用JOIN . 我避免CTE与ROW_NUMBER（），除非我需要做一些先进的事情，并且对窗口性能的惩罚是好的 .

我还避免在WHERE或ON子句中使用EXISTS / IN子查询，因为我经历过这会导致一些糟糕的执行计划 . 但里程各不相同 . 在需要的地方和时间检查执行计划和配置文件性能！

回复于 2024-05-05T23:17:28+08:00

在您要避免使用row_count（）的情况下，您还可以使用左连接：

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
left join DocumentStatusLogs filter 
    ON ds.DocumentID = filter.DocumentID
    -- Match any row that has another row that was created after it.
    AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched 
where filter.DocumentID is null

对于示例模式，您还可以使用“not in subquery”，它通常编译为与左连接相同的输出：

select ds.DocumentID, ds.Status, ds.DateCreated 
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
    SELECT filter.ID 
    FROM DocumentStatusLogs filter
    WHERE ds.DocumentID = filter.DocumentID
        AND ds.DateCreated < filter.DateCreated)

注意，如果表没有至少一个单列唯一键/约束/索引，则子查询模式将不起作用，在这种情况下主键“Id” .

这两个查询往往比row_count（）查询（由查询分析器测量）更“昂贵” . 但是，您可能会遇到更快返回结果或启用其他优化的情况 .

回复于 2024-05-05T23:17:28+08:00

我的代码从每个组中选择前1名

select a.* from #DocumentStatusLogs a where 
 datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where 
a.documentid = b.documentid
order by datecreated desc
)

回复于 2024-05-05T23:17:28+08:00

3
```
SELECT * FROM
DocumentStatusLogs JOIN (
  SELECT DocumentID, MAX(DateCreated) DateCreated
  FROM DocumentStatusLogs
  GROUP BY DocumentID
  ) max_date USING (DocumentID, DateCreated)
```
什么数据库服务器此代码不适用于所有这些代码 .

关于你问题的后半部分，将状态列为专栏似乎是合理的 . 您可以将 DocumentStatusLogs 保留为日志，但仍会将最新信息存储在主表中 .

顺便说一句，如果你已经在Documents表中有 DateCreated 列，你可以使用它加入 DocumentStatusLogs （只要 DocumentStatusLogs 在 DocumentStatusLogs 中是唯一的） .

编辑：MsSQL不支持USING，因此将其更改为：
```
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
```
回复于 2024-05-05T23:17:28+08:00
25
```
;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
```
如果您预计每天有2个条目，那么这将随意选择一个 . 要获得一天的两个条目，请改用DENSE_RANK

至于规范化与否，取决于你是否要：
- 维持2个地方的状态
- 保存状态历史记录
- ......
就目前而言，您可以保留状态历史记录 . 如果你想要父表中的最新状态（这是非规范化），你需要一个触发器来维持父状态中的“状态” . 或删除此状态历史记录表 .
回复于 2024-05-05T23:17:28+08:00
0
如果您担心性能，也可以使用MAX（）执行此操作：
```
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
```
ROW_NUMBER（）需要SELECT语句中的所有行，而MAX则不需要 . 应该大大加快您的查询速度 .
回复于 2024-05-05T23:17:28+08:00

123

这是我能想到的最普通的TSQL

SELECT * FROM DocumentStatusLogs D1 JOIN
    (
      SELECT
        DocumentID,MAX(DateCreated) AS MaxDate
      FROM
        DocumentStatusLogs
      GROUP BY
        DocumentID
    ) D2
    ON
      D2.DocumentID=D1.DocumentID
    AND
      D2.MaxDate=D1.DateCreated

回复于 2024-05-05T23:17:28+08:00

0
这是关于这个主题最容易找到的问题之一，所以我想给它一个现代的答案（供我参考和帮助其他人） . 通过使用over和first值，您可以简单地完成上述查询：
```
select distinct DocumentID
  , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
  , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
```
这应该在sql server 2008及以上版本中运行 . 第一个值可以被认为是在使用over子句时完成选择top 1的一种方法 . Over允许在选择列表中进行分组，因此不是编写嵌套子查询（就像许多现有的答案一样），而是以更易读的方式进行分组 . 希望这可以帮助 .
回复于 2024-05-05T23:17:28+08:00

获得每组的前1行

16 回答

相关问题