在之前的工作中,我们不得不将项目x与项目x-1进行比较,以获得大量数据(〜十亿行).由于这是在SQL Server 2008 R2上完成的,我们必须使用自联接.这很慢.
我以为我会试验滞后函数; 如果速度快,这将是非常有价值的.我发现它快2到3倍,但因为它应该是一个简单的操作引擎盖,并且因为它的查询计划/表扫描更简单/大大减少,我非常失望.代码重现如下.
创建数据库:
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'TestDBForLag') DROP DATABASE TestDBForLag GO create database TestDBForLag ALTER DATABASE TestDBForLag SET RECOVERY SIMPLE go use TestDBForLag go set nocount on create table big (g binary(16) not null) go begin transaction declare @c int = 0 while @c < 100 begin insert into big(g) values(cast(newid() as binary(16))) set @c += 1 end commit go 10000 -- n repeats of last batch, "big" now has 1,000,000 rows alter table big add constraint clustered_PK primary key clustered (g)
查询:
set statistics time on set statistics io on -- new style select g, lag(g, 1) over (order by g) as xx from big order by g -- old style select obig.g, ( select max(g) from big as ibig where ibig.g < obig.g ) as xx from big as obig order by g
您可以自己查看实际/估计的查询计划,但这里是统计数据的结果(查询运行两次以折扣编译时间):
(1000000 row(s) affected) Table 'Worktable'. {edit: everything zero here}. **Table 'big'. Scan count 1, logical reads 3109**, {edit: everything else is zero here}. SQL Server Execution Times: CPU time = 1045 ms, elapsed time = 3516 ms. --- (1000000 row(s) affected) **Table 'big'. Scan count 1000001, logical reads 3190609**, {edit: everything else is zero here}. SQL Server Execution Times:CPU time = 2683 ms, elapsed time = 3439 ms.
因此,lag
需要1次扫描+ 3109次读取并需要约1秒的CPU时间,一个复杂的自连接必须重复遍历btree需要100万次扫描+320万次读取需要~2.7秒.
我认为这种糟糕的表现没有任何理由.有任何想法吗?
在ThinkServer 140,8G ram上运行(所以完全是mem驻留),双核,没有磁盘争用.我很满意将结果集传输到在同一台机器上运行的客户端的时间是可以忽略的.
select @@version
收益:
Microsoft SQL Server 2014 - 12.0.4213.0 (X64) Developer Edition (64-bit) on Windows NT 6.1(Build 7601: Service Pack 1)
编辑:
根据@ vnov的评论,在我发布之前,我确实小心地折扣了客户端开销.我在看CPU时间而不是整体时间.测试:
select * from big Table 'big'. Scan count 1, logical reads 3109, {rest zero} SQL Server Execution Times: CPU time = 125 ms, elapsed time = 2840 ms. select count(*) from big Table 'big'. Scan count 1, logical reads 3109, {rest zero} SQL Server Execution Times: CPU time = 109 ms, elapsed time = 129 ms.
lag
只是不应该添加任何重要的AFAICS,更不用说一个数量级了.
EDIT2:
@Frisbee不明白为什么我认为滞后很差.基本上,该算法是记住先前的值并在之后的n行中传递它.如果n = 1这更加微不足道,那么我使用游标做了一些代码,有和没有自制滞后,并测量.我还简单地总结了结果,因此根据vnov的观点,它并没有返回巨大的结果集.游标和选择都给出了相同的sumg = 127539666,sumglag = 127539460的结果.代码使用与上面创建的相同的DB +表.
选择版本:
select sum(cast(g as tinyint)) as sumg from ( select g from big ) as xx select sum(cast(g as tinyint)) as sumg, sum(cast(glag as tinyint)) as sumglag from ( select g, lag(g, 1) over (order by g) as glag from big ) as xx
我没有进行批量测量,但通过观察,这里的普通选择与滞后相当一致~360-400ms对比~1700-1900ms,因此慢4或5倍.
对于游标,顶部模拟第一个选择,底部模拟选择滞后:
---------- nonlagging batch -------------- use TestDBForLag set nocount on DECLARE crsr CURSOR FAST_FORWARD READ_ONLY FOR select g from big order by g DECLARE @g binary(16), @sumg int = 0 OPEN crsr FETCH NEXT FROM crsr INTO @g WHILE (@@fetch_status <> -1) BEGIN IF (@@fetch_status <> -2) BEGIN set @sumg += cast(@g as tinyint) END FETCH NEXT FROM crsr INTO @g END CLOSE crsr DEALLOCATE crsr select @sumg as sumg go 300 ---------- lagging batch -------------- use TestDBForLag set nocount on DECLARE crsr CURSOR FAST_FORWARD READ_ONLY FOR select g from big order by g DECLARE @g binary(16), @sumg int = 0 DECLARE @glag binary(16) = 0, @sumglag int = 0 OPEN crsr FETCH NEXT FROM crsr INTO @g WHILE (@@fetch_status <> -1) BEGIN IF (@@fetch_status <> -2) BEGIN set @sumg += cast(@g as tinyint) set @sumglag += cast(@glag as tinyint) -- the only ... set @glag = @g -- ... differences END FETCH NEXT FROM crsr INTO @g END CLOSE crsr DEALLOCATE crsr select @sumg as sumg, @sumglag as sumglag go 300
运行上面的SQL分析器(删除SQL:批量启动事件),对我来说需要大约2.5小时,将跟踪保存为名为"跟踪"的表,然后运行此命令以获得平均持续时间
-- trace save duration as microseconds, -- divide by 1000 to get back to milli select cast(textdata as varchar(8000)) as textdata, avg(duration/1000) as avg_duration_ms from trace group by cast(textdata as varchar(8000))
对我来说,非滞留游标平均需要13.65秒,光标仿真延迟需要16.04秒.后者的大部分额外时间都来自解释器处理额外语句的开销(如果在C中实现,我预计它会少得多),但无论如何,计算滞后的时间少于20% .
那么,这个解释听起来是否合理,并且任何人都可以建议为什么滞后在select语句中表现如此糟糕?
检查两种变体的执行计划,你会看到发生了什么.我为此使用免费版本的SQL Sentry Plan Explorer.
我正在比较这三个查询(另外还有一个OUTER APPLY
):
select count(*) from big; -- new style select g, lag(g) over (order by g) as xx from big order by g; -- old style select obig.g, ( select max(g) from big as ibig where ibig.g < obig.g ) as xx from big as obig order by g;
1)LAG
使用Window Spool实现,它提供两倍于临时工作表的行数(1,999,999)(在这种情况下它在内存中,但仍然).该窗口滑不缓存在工作台全部1,000,000行,它只缓存窗口大小.
Window Spool操作符将每一行扩展为表示与其关联的窗口的行集.
计划中还有许多其他不那么重的运营商.这里的要点是LAG
在游标测试中没有实现.
2)旧式查询的计划非常好.优化器可以智能地扫描一次表TOP
并对每行进行索引搜索以进行计算MAX
.是的,这是百万寻求,但一切都在记忆中,所以它相对较快.
3)将鼠标悬停在计划运算符之间的粗箭头上,您将看到实际的数据大小.它是Window Spool的两倍大.因此,当一切都在内存和CPU绑定时,它变得很重要.
4)您的旧样式查询可以重写为:
select obig.g, A.g from big as obig OUTER APPLY ( SELECT TOP(1) ibig.g FROM big as ibig WHERE ibig.g < obig.g ORDER BY ibig.g DESC ) AS A order by obig.g;
,效率更高一些(参见截图中的CPU列).
因此,LAG
读取页数非常高效,但使用CPU相当多.