max()与ORDER BY DESC + LIMIT 1的性能

作者：殉情放开那只小兔子 | 2023-09-10 13:41

如何解决《max()与ORDERBYDESC+LIMIT1的性能》经验，为你挑选了1个好方法。

我今天正在对一些慢速SQL查询进行故障排除,并且不太了解下面的性能差异:

当尝试max(timestamp)基于某些条件从数据表中提取时,使用MAX()比ORDER BY timestamp LIMIT 1匹配行存在时慢,但如果找不到匹配的行则相当快.

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)  
Time: 1314.544 ms

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)  
Time: 10.890 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms

上有索引(timestamp)和(sensor_id, timestamp),和我注意到,Postgres使用这两种情况下非常不同的查询计划和索引:

QUERY PLAN (ORDER BY)                                              
--------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..9.47 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..396254.63 rows=43823 width=8)
          Join Filter: (data.sensor_id = sensors.id)
          ->  Index Scan using timestamp_ind on data  (cost=0.43..254918.66 rows=4710976 width=12)
          ->  Materialize  (cost=0.00..6.70 rows=2 width=4)
              ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
                  Filter: (station_id = 4)
(7 rows)

QUERY PLAN (MAX)                                               
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3680.59..3680.60 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..3571.03 rows=43823 width=8)
        ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
              Filter: (station_id = 4)
        ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12)
              Index Cond: (sensor_id = sensors.id)
(6 rows)

所以我的两个问题是:

这种性能差异来自哪里？我已经看到了MIN/MAX与ORDER BY和LIMIT接受的答案,但这似乎并不适用于此.任何好的资源将不胜感激.

是否有更好的方法来增加所有情况下的性能(匹配行与没有匹配的行)而不是添加EXISTS检查？

编辑以解决以下评论中的问题.我保留了上面的初始查询计划以供将来参考:

表定义:

                                  Table "public.sensors"
        Column        |          Type          |                            Modifiers                            
----------------------+------------------------+-----------------------------------------------------------------
id                    | integer                | not null default nextval('sensors_id_seq'::regclass)
station_id            | integer                | not null
....

Indexes:
    "sensor_primary" PRIMARY KEY, btree (id)
    "ind_station_id" btree (station_id, id)
    "ind_station" btree (station_id)

                                  Table "public.data"
  Column   |           Type           |                            Modifiers                             
-----------+--------------------------+------------------------------------------------------------------
 id        | integer                  | not null default nextval('data_id_seq'::regclass)
 timestamp | timestamp with time zone | not null
 sensor_id | integer                  | not null
 avg       | integer                  |

Indexes:
    "timestamp_ind" btree ("timestamp" DESC)
    "sensor_ind" btree (sensor_id)
    "sensor_ind_timestamp" btree (sensor_id, "timestamp")
    "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)

请注意,我说ind_station_id的sensors刚才下面@欧文的建议后.计时还没有真正改变厉害,还是>1200ms在ORDER BY DESC + LIMIT 1案件和~0.9ms在MAX情况.

查询计划:

QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit  (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
  Buffers: shared hit=3418066 read=47326
  ->  Nested Loop  (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
        Join Filter: (data.sensor_id = sensors.id)
        Buffers: shared hit=3418066 read=47326
        ->  Index Scan using timestamp_ind on data  (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
              Buffers: shared hit=3418065 read=47326
        ->  Materialize  (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
              Buffers: shared hit=1
              ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (station_id = 4)
                    Heap Fetches: 0
                    Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)

QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Nested Loop  (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
         Buffers: shared hit=1
         ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
               Index Cond: (station_id = 4)
               Heap Fetches: 0
               Buffers: shared hit=1
         ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12) (never executed)
               Index Cond: (sensor_id = sensors.id)
               Heap Fetches: 0
 Planning time: 0.435 ms
 Execution time: 0.048 ms
 (13 rows)

所以就像之前的解释一样,ORDER BY做了一个Scan using timestamp_in on data,在这种MAX情况下没有做到.

Postgres版本:来自Ubuntu repos的Postgres: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

请注意,存在NOT NULL约束,因此ORDER BY不必对空行进行排序.

另请注意,我对差异的来源非常感兴趣.虽然不理想,但我可以使用EXISTS (<1ms)然后相对快速地检索数据SELECT (~11ms).

1> Erwin Brands..：

似乎没有索引sensor.station_id,这在这里最重要.

和之间存在实际差异.很多人似乎都错过了.NULL值排第一的降序排序.因此返回一行(如果存在),而aggregate函数忽略 NULL值并返回最新的非空时间戳.max()ORDER BY DESC + LIMIT 1ORDER BY timestamp DESC LIMIT 1timestamp IS NULLmax()

对于您的情况,由于您的列d.timestamp已定义NOT NULL(如您的更新显示),因此没有有效的区别.用于查询的索引DESC NULLS LAST和相同的子句应该仍然是最好的.我建议这些索引(我的查询建立在第二个上):ORDER BYLIMIT

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

您可以删除其他索引变体~~sensor_ind_timestamp~~,~~sensor_ind_timestamp_desc~~除非您有其他查询仍然需要它们(不太可能,但可能).

更重要的是,还有另一个困难:第一个表上的过滤器sensors返回很少但仍然(可能)多行.Postgres 希望rows=2在你添加的EXPLAIN输出中找到2行().
完美的技术将是第二个表的松散索引扫描data - 目前在Postgres 9.4(或Postgres 9.5)中没有实现.您可以通过各种方式重写查询以解决此限制.细节:

优化GROUP BY查询以检索每个用户的最新记录

最好的应该是:

SELECT d.timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4
ORDER  BY d.timestamp DESC NULLS LAST
LIMIT  1;

由于外部查询的样式大多不相关,您还可以:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4;

并且max()变体现在应该以尽可能快的速度执行:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT max(timestamp) AS timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ) d
WHERE  s.station_id = 4;

甚至,最短的:

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM   sensors s
WHERE  station_id = 4;

请注意双括号!

额外的优势LIMIT在LATERAL加入的是,你可以检索选定行的任意列,而不仅仅是最新的时间戳(一列).

有关:

为什么在PostgreSQL查询中排序DESC时会出现NULL值？

LATERAL和PostgreSQL中的子查询有什么区别？

选择每个GROUP BY组中的第一行？

优化分组最大查询

推荐阅读

程序员
Malloc是一组函数指针

如何解决《Malloc是一组函数指针》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何检测Windows Mobile转换为连续模式？

如何解决《如何检测WindowsMobile转换为连续模式？》经验，为你挑选了0个好方法。 ... [详细]
程序员
如何模式匹配scala列表的head和tail类型？

如何解决《如何模式匹配scala列表的head和tail类型？》经验，为你挑选了1个好方法。 ... [详细]
程序员
闪亮的应用程序checkboxInput和conditionalPanel

如何解决《闪亮的应用程序checkboxInput和conditionalPanel》经验，为你挑选了1个好方法。 ... [详细]
程序员
在python3.5 osx上安装pygame

如何解决《在python3.5osx上安装pygame》经验，为你挑选了1个好方法。 ... [详细]
程序员
在MySql中执行查询时与only_full_group_by相关的错误

如何解决《在MySql中执行查询时与only_full_group_by相关的错误》经验，为你挑选了12个好方法。 ... [详细]
程序员
Redis引发错误:NOAUTH需要身份验证,但没有密码设置

如何解决《Redis引发错误:NOAUTH需要身份验证,但没有密码设置》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何获取MPI中的物理机数量

如何解决《如何获取MPI中的物理机数量》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何使用airplay以编程方式发现iOS设备列表和Connect for mirroring？

如何解决《如何使用airplay以编程方式发现iOS设备列表和Connectformirroring？》经验，为你挑选了1个好方法。 ... [详细]
程序员
有没有正确的方法来通过其路径获取文件？

如何解决《有没有正确的方法来通过其路径获取文件？》经验，为你挑选了0个好方法。 ... [详细]
程序员
如何获得Python解释器堆栈的当前深度？

如何解决《如何获得Python解释器堆栈的当前深度？》经验，为你挑选了2个好方法。 ... [详细]
程序员
在C#中如何检查日期是否自1990年以来每隔5天一次？

如何解决《在C#中如何检查日期是否自1990年以来每隔5天一次？》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用epplus C#更新现有工作簿

如何解决《使用epplusC#更新现有工作簿》经验，为你挑选了2个好方法。 ... [详细]
程序员
如何使用REST API从Nexus中删除带有分类器的工件？

如何解决《如何使用RESTAPI从Nexus中删除带有分类器的工件？》经验，为你挑选了1个好方法。 ... [详细]
程序员
片段内容在导航抽屉中的ActionBar上重叠

如何解决《片段内容在导航抽屉中的ActionBar上重叠》经验，为你挑选了1个好方法。 ... [详细]
程序员
节点在Windows 10上找不到webpack模块

如何解决《节点在Windows10上找不到webpack模块》经验，为你挑选了1个好方法。 ... [详细]
程序员
添加外部Jar库以构建Cordova插件(Ionic Framework)

如何解决《添加外部Jar库以构建Cordova插件(IonicFramework)》经验，为你挑选了1个好方法。 ... [详细]
程序员
C++虚函数行为

如何解决《C++虚函数行为》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何从Swift中的可选类型获取未包装的类型？

如何解决《如何从Swift中的可选类型获取未包装的类型？》经验，为你挑选了1个好方法。 ... [详细]
程序员
Android Studio)'anim'文件夹在哪里？

如何解决《AndroidStudio)'anim'文件夹在哪里？》经验，为你挑选了3个好方法。 ... [详细]

殉情放开那只小兔子

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章