我今天正在对一些慢速SQL查询进行故障排除,并且不太了解下面的性能差异:
当尝试max(timestamp)
基于某些条件从数据表中提取时,使用MAX()
比ORDER BY timestamp LIMIT 1
匹配行存在时慢,但如果找不到匹配的行则相当快.
SELECT timestamp FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 4 ORDER BY timestamp DESC LIMIT 1; (0 rows) Time: 1314.544 ms SELECT timestamp FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 5 ORDER BY timestamp DESC LIMIT 1; (1 row) Time: 10.890 ms SELECT MAX(timestamp) FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 4; (0 rows) Time: 0.869 ms SELECT MAX(timestamp) FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 5; (1 row) Time: 84.087 ms
上有索引(timestamp)
和(sensor_id, timestamp)
,和我注意到,Postgres使用这两种情况下非常不同的查询计划和索引:
QUERY PLAN (ORDER BY) -------------------------------------------------------------------------------------------------------- Limit (cost=0.43..9.47 rows=1 width=8) -> Nested Loop (cost=0.43..396254.63 rows=43823 width=8) Join Filter: (data.sensor_id = sensors.id) -> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12) -> Materialize (cost=0.00..6.70 rows=2 width=4) -> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4) Filter: (station_id = 4) (7 rows) QUERY PLAN (MAX) ---------------------------------------------------------------------------------------------------------- Aggregate (cost=3680.59..3680.60 rows=1 width=8) -> Nested Loop (cost=0.43..3571.03 rows=43823 width=8) -> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4) Filter: (station_id = 4) -> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) Index Cond: (sensor_id = sensors.id) (6 rows)
所以我的两个问题是:
这种性能差异来自哪里?我已经看到了MIN/MAX与ORDER BY和LIMIT接受的答案,但这似乎并不适用于此.任何好的资源将不胜感激.
是否有更好的方法来增加所有情况下的性能(匹配行与没有匹配的行)而不是添加EXISTS
检查?
编辑以解决以下评论中的问题.我保留了上面的初始查询计划以供将来参考:
表定义:
Table "public.sensors" Column | Type | Modifiers ----------------------+------------------------+----------------------------------------------------------------- id | integer | not null default nextval('sensors_id_seq'::regclass) station_id | integer | not null .... Indexes: "sensor_primary" PRIMARY KEY, btree (id) "ind_station_id" btree (station_id, id) "ind_station" btree (station_id) Table "public.data" Column | Type | Modifiers -----------+--------------------------+------------------------------------------------------------------ id | integer | not null default nextval('data_id_seq'::regclass) timestamp | timestamp with time zone | not null sensor_id | integer | not null avg | integer | Indexes: "timestamp_ind" btree ("timestamp" DESC) "sensor_ind" btree (sensor_id) "sensor_ind_timestamp" btree (sensor_id, "timestamp") "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)
请注意,我说ind_station_id
的sensors
刚才下面@欧文的建议后.计时还没有真正改变厉害,还是>1200ms
在ORDER BY DESC + LIMIT 1
案件和~0.9ms
在MAX
情况.
查询计划:
QUERY PLAN (ORDER BY) ---------------------------------------------------------------------------------------------------------- Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1) Buffers: shared hit=3418066 read=47326 -> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1) Join Filter: (data.sensor_id = sensors.id) Buffers: shared hit=3418066 read=47326 -> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1) Buffers: shared hit=3418065 read=47326 -> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976) Buffers: shared hit=1 -> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1) Index Cond: (station_id = 4) Heap Fetches: 0 Buffers: shared hit=1 Planning time: 0.478 ms Execution time: 2161.090 ms (15 rows) QUERY (MAX) ---------------------------------------------------------------------------------------------------------- Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1) Buffers: shared hit=1 -> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1) Buffers: shared hit=1 -> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1) Index Cond: (station_id = 4) Heap Fetches: 0 Buffers: shared hit=1 -> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed) Index Cond: (sensor_id = sensors.id) Heap Fetches: 0 Planning time: 0.435 ms Execution time: 0.048 ms (13 rows)
所以就像之前的解释一样,ORDER BY
做了一个Scan using timestamp_in on data
,在这种MAX
情况下没有做到.
Postgres版本:来自Ubuntu repos的Postgres: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
请注意,存在NOT NULL
约束,因此ORDER BY
不必对空行进行排序.
另请注意,我对差异的来源非常感兴趣.虽然不理想,但我可以使用EXISTS (<1ms)
然后相对快速地检索数据SELECT (~11ms)
.
似乎没有索引sensor.station_id
,这在这里最重要.
和之间存在实际差异.很多人似乎都错过了.NULL值排第一的降序排序.因此返回一行(如果存在),而aggregate函数忽略 NULL值并返回最新的非空时间戳.max()
ORDER BY DESC + LIMIT 1
ORDER BY timestamp DESC LIMIT 1
timestamp IS NULL
max()
对于您的情况,由于您的列d.timestamp
已定义NOT NULL
(如您的更新显示),因此没有有效的区别.用于查询的索引DESC NULLS LAST
和相同的子句应该仍然是最好的.我建议这些索引(我的查询建立在第二个上):ORDER BY
LIMIT
sensor(station_id, id) data(sensor_id, timestamp DESC NULLS LAST)
您可以删除其他索引变体,sensor_ind_timestamp
除非您有其他查询仍然需要它们(不太可能,但可能).sensor_ind_timestamp_desc
更重要的是,还有另一个困难:第一个表上的过滤器sensors
返回很少但仍然(可能)多行.Postgres 希望rows=2
在你添加的EXPLAIN
输出中找到2行().
完美的技术将是第二个表的松散索引扫描data
- 目前在Postgres 9.4(或Postgres 9.5)中没有实现.您可以通过各种方式重写查询以解决此限制.细节:
优化GROUP BY查询以检索每个用户的最新记录
最好的应该是:
SELECT d.timestamp FROM sensors s CROSS JOIN LATERAL ( SELECT timestamp FROM data WHERE sensor_id = s.id ORDER BY timestamp DESC NULLS LAST LIMIT 1 ) d WHERE s.station_id = 4 ORDER BY d.timestamp DESC NULLS LAST LIMIT 1;
由于外部查询的样式大多不相关,您还可以:
SELECT max(d.timestamp) AS timestamp FROM sensors s CROSS JOIN LATERAL ( SELECT timestamp FROM data WHERE sensor_id = s.id ORDER BY timestamp DESC NULLS LAST LIMIT 1 ) d WHERE s.station_id = 4;
并且max()
变体现在应该以尽可能快的速度执行:
SELECT max(d.timestamp) AS timestamp FROM sensors s CROSS JOIN LATERAL ( SELECT max(timestamp) AS timestamp FROM data WHERE sensor_id = s.id ) d WHERE s.station_id = 4;
甚至,最短的:
SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp FROM sensors s WHERE station_id = 4;
请注意双括号!
额外的优势LIMIT
在LATERAL
加入的是,你可以检索选定行的任意列,而不仅仅是最新的时间戳(一列).
有关:
为什么在PostgreSQL查询中排序DESC时会出现NULL值?
LATERAL和PostgreSQL中的子查询有什么区别?
选择每个GROUP BY组中的第一行?
优化分组最大查询