在MySQL中获取累积总和的"正确"查询是什么?
我有一个表格,我保存有关文件的信息,一列列表包含文件的大小(以字节为单位).(实际文件保存在某处的磁盘上)
我想得到这样的累积文件大小:
+------------+---------+--------+----------------+ | fileInfoId | groupId | size | cumulativeSize | +------------+---------+--------+----------------+ | 1 | 1 | 522120 | 522120 | | 2 | 2 | 316042 | 316042 | | 4 | 2 | 711084 | 1027126 | | 5 | 2 | 697002 | 1724128 | | 6 | 2 | 663425 | 2387553 | | 7 | 2 | 739553 | 3127106 | | 8 | 2 | 700938 | 3828044 | | 9 | 2 | 695614 | 4523658 | | 10 | 2 | 744204 | 5267862 | | 11 | 2 | 609022 | 5876884 | | ... | ... | ... | ... | +------------+---------+--------+----------------+ 20000 rows in set (19.2161 sec.)
现在,我使用以下查询来获得上述结果
SELECT a.fileInfoId , a.groupId , a.size , SUM(b.size) AS cumulativeSize FROM fileInfo AS a LEFT JOIN fileInfo AS b USING(groupId) WHERE a.fileInfoId >= b.fileInfoId GROUP BY a.fileInfoId ORDER BY a.groupId, a.fileInfoId
然而,我的解决方案非常缓慢.(大约19秒没有缓存).
Explain给出了以下执行细节
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+ | 1 | SIMPLE | a | index | PRIMARY,foreignId | PRIMARY | 4 | NULL | 14905 | | | 1 | SIMPLE | b | ref | PRIMARY,foreignId | foreignId | 4 | db.a.foreignId | 36 | Using where | +----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
如何优化上述查询?
更新
我已更新问题,以提供表结构和一个程序,用20,000个记录测试数据填充表.
CREATE TABLE `fileInfo` ( `fileInfoId` int(10) unsigned NOT NULL AUTO_INCREMENT , `groupId` int(10) unsigned NOT NULL , `name` varchar(128) NOT NULL , `size` int(10) unsigned NOT NULL , PRIMARY KEY (`fileInfoId`) , KEY `groupId` (`groupId`) ) ENGINE=InnoDB; delimiter $$ DROP PROCEDURE IF EXISTS autofill$$ CREATE PROCEDURE autofill() BEGIN DECLARE i INT DEFAULT 0; DECLARE gid INT DEFAULT 0; DECLARE nam char(20); DECLARE siz INT DEFAULT 0; WHILE i < 20000 DO SET gid = FLOOR(RAND() * 250); SET nam = CONV(FLOOR(RAND() * 10000000000000), 20, 36); SET siz = FLOOR((RAND() * 1024 * 1024)); INSERT INTO `fileInfo` (`groupId`, `name`, `size`) VALUES(gid, nam, siz); SET i = i + 1; END WHILE; END;$$ delimiter ; CALL autofill();
关于可能的重复问题被遗忘的分号链接
的问题不是同一个问题.我的问题有额外的专栏.因为这个额外的groupId列,那里接受的答案对我的问题不起作用.(也许它可以适应工作,但我不知道如何,因此我的问题)
你可以使用一个变量 - 它比任何连接快得多:
SELECT id, size, @total := @total + size AS cumulativeSize, FROM table, (SELECT @total:=0) AS t;
这是一个运行Debian 5.0的奔腾III的128MB RAM的快速测试案例:
创建表:
DROP TABLE IF EXISTS `table1`; CREATE TABLE `table1` ( `id` int(11) NOT NULL auto_increment, `size` int(11) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB;
填写20,000个随机数:
DELIMITER // DROP PROCEDURE IF EXISTS autofill// CREATE PROCEDURE autofill() BEGIN DECLARE i INT DEFAULT 0; WHILE i < 20000 DO INSERT INTO table1 (size) VALUES (FLOOR((RAND() * 1000))); SET i = i + 1; END WHILE; END; // DELIMITER ; CALL autofill();
检查行数:
SELECT COUNT(*) FROM table1; +----------+ | COUNT(*) | +----------+ | 20000 | +----------+
运行累计总查询:
SELECT id, size, @total := @total + size AS cumulativeSize FROM table1, (SELECT @total:=0) AS t; +-------+------+----------------+ | id | size | cumulativeSize | +-------+------+----------------+ | 1 | 226 | 226 | | 2 | 869 | 1095 | | 3 | 668 | 1763 | | 4 | 733 | 2496 | ... | 19997 | 966 | 10004741 | | 19998 | 522 | 10005263 | | 19999 | 713 | 10005976 | | 20000 | 0 | 10005976 | +-------+------+----------------+ 20000 rows in set (0.07 sec)
UPDATE
在原始问题中,我错过了groupId的分组,这确实让事情变得有点棘手.然后我写了一个使用临时表的解决方案,但我不喜欢它 - 它太乱了,太复杂了.我离开了,做了一些更多的研究,并提出了一些更简单,更快速的东西.
我不能要求所有的信用为这个,其实我可以勉强主张任何可言,因为它只是一个修改版本效仿行数从常见的MySQL查询.
它非常简单,优雅,而且非常快速:
SELECT fileInfoId, groupId, name, size, cumulativeSize FROM ( SELECT fileInfoId, groupId, name, size, @cs := IF(@prev_groupId = groupId, @cs+size, size) AS cumulativeSize, @prev_groupId := groupId AS prev_groupId FROM fileInfo, (SELECT @prev_groupId:=0, @cs:=0) AS vars ORDER BY groupId ) AS tmp;
SELECT ... AS tmp
如果您不介意prev_groupID
返回列,则可以删除外部.没有它,我发现它的运行速度略快.
这是一个简单的测试用例:
INSERT INTO `fileInfo` VALUES ( 1, 3, 'name0', '10'), ( 5, 3, 'name1', '10'), ( 7, 3, 'name2', '10'), ( 8, 1, 'name3', '10'), ( 9, 1, 'name4', '10'), (10, 2, 'name5', '10'), (12, 4, 'name6', '10'), (20, 4, 'name7', '10'), (21, 4, 'name8', '10'), (25, 5, 'name9', '10'); SELECT fileInfoId, groupId, name, size, cumulativeSize FROM ( SELECT fileInfoId, groupId, name, size, @cs := IF(@prev_groupId = groupId, @cs+size, size) AS cumulativeSize, @prev_groupId := groupId AS prev_groupId FROM fileInfo, (SELECT @prev_groupId := 0, @cs := 0) AS vars ORDER BY groupId ) AS tmp; +------------+---------+-------+------+----------------+ | fileInfoId | groupId | name | size | cumulativeSize | +------------+---------+-------+------+----------------+ | 8 | 1 | name3 | 10 | 10 | | 9 | 1 | name4 | 10 | 20 | | 10 | 2 | name5 | 10 | 10 | | 1 | 3 | name0 | 10 | 10 | | 5 | 3 | name1 | 10 | 20 | | 7 | 3 | name2 | 10 | 30 | | 12 | 4 | name6 | 10 | 10 | | 20 | 4 | name7 | 10 | 20 | | 21 | 4 | name8 | 10 | 30 | | 25 | 5 | name9 | 10 | 10 | +------------+---------+-------+------+----------------+
以下是20,000行表中最后几行的示例:
| 19481 | 248 | 8CSLJX22RCO | 1037469 | 51270389 | | 19486 | 248 | 1IYGJ1UVCQE | 937150 | 52207539 | | 19817 | 248 | 3FBU3EUSE1G | 616614 | 52824153 | | 19871 | 248 | 4N19QB7PYT | 153031 | 52977184 | | 132 | 249 | 3NP9UGMTRTD | 828073 | 828073 | | 275 | 249 | 86RJM39K72K | 860323 | 1688396 | | 802 | 249 | 16Z9XADLBFI | 623030 | 2311426 | ... | 19661 | 249 | ADZXKQUI0O3 | 837213 | 39856277 | | 19870 | 249 | 9AVRTI3QK6I | 331342 | 40187619 | | 19972 | 249 | 1MTAEE3LLEM | 1027714 | 41215333 | +------------+---------+-------------+---------+----------------+ 20000 rows in set (0.31 sec)