我有一些包含日期的数据.我正在尝试按连续日期对数据进行分组,但是,日期并不是完全连续的.这是一个例子:
DateColumn | Value
------------------------+-------
2017-01-18 01:12:34.107 | 215426 <- batch no. 1
2017-01-18 01:12:34.113 | 215636
2017-01-18 01:12:34.623 | 123516
2017-01-18 01:12:34.633 | 289926
2017-01-18 04:58:42.660 | 259063 <- batch no. 2
2017-01-18 04:58:42.663 | 261830
2017-01-18 04:58:42.893 | 219835
2017-01-18 04:58:42.907 | 250165
2017-01-18 05:18:14.660 | 134253 <- batch no. 3
2017-01-18 05:18:14.663 | 134257
2017-01-18 05:18:14.667 | 134372
2017-01-18 05:18:15.040 | 181679
2017-01-18 05:18:15.043 | 226368
2017-01-18 05:18:15.043 | 227070
数据是批量生成的,批处理中的每一行都需要几毫秒才能生成.我正在尝试将结果分组如下:
Date1 | Date2 | Count
------------------------+-------------------------+------
2017-01-18 01:12:34.107 | 2017-01-18 01:12:34.633 | 4
2017-01-18 04:58:42.660 | 2017-01-18 04:58:42.907 | 4
2017-01-18 05:18:14.660 | 2017-01-18 05:18:15.043 | 6
可以安全地假设,如果两个连续的行间隔超过1分钟,那么它们属于不同的批次.
我尝试了涉及ROW_NUMBER
功能的解决方案,但它们使用连续日期(两行之间的日期差异是固定的).当差异模糊时,我怎样才能达到预期的效果?
请注意,批次可能会超过一分钟.例如,批次可能包含从2017-01-01 00:00:00开始到2017-01-01 00:05:00结束的行,包括~3000行,每行几十或几百毫秒.可以肯定的是,批次间隔至少1分钟.
试试这个:
select min(t.dateColumn) date1, max(t.dateColumn) date2, count(*) from ( select t.*, sum(val) over ( order by t.dateColumn ) grp from ( select t.*, case when datediff(ms, lag(t.dateColumn, 1, t.dateColumn) over ( order by t.dateColumn ), t.dateColumn) > 60000 then 1 else 0 end val from your_table t ) t ) t group by grp;
生产:
使用分析函数lag()
根据与上一批的差异标记下一批的开始,datecolumn
然后使用分析sum()
创建批次组,然后按其分组以查找所需的聚合.
由于四舍五入问题,可能会在小组中出现一些错误分类DATETIME
.来自MSDN
,
datetime值四舍五入为.000,.003或.007秒的增量,如下表所示.
这是使用CTE重写的相同查询:
WITH cte1(DateColumn, ValueColumn) AS ( -- Insert your query that returns a datetime column and any other column SELECT SomeDate, SomeValue FROM SomeTable WHERE SomeColumn IS NOT NULL ), cte2 AS ( -- This query adds a column called "val" that contains -- 1 when current row date - previous row date > 1 minute -- 0 otherwise SELECT cte1.*, CASE WHEN DATEDIFF(MS, LAG(DateColumn, 1, DateColumn) OVER (ORDER BY DateColumn), DateColumn) > 60000 THEN 1 ELSE 0 END AS val FROM cte1 ), cte3 AS ( -- This query adds a column called "grp" that numbers -- the groups using running sum over the "val" column SELECT cte2.*, SUM(val) OVER (ORDER BY DateColumn) AS grp FROM cte2 ) SELECT MIN(DateColumn) Date1, MAX(DateColumn) Date2, COUNT(ValueColumn) [Count] FROM cte3 GROUP BY grp