我发现这个样本面试问题和答案发布在toptal上.但我真的不懂代码.UNION ALL如何变成UNIION(不同)?另外,为什么这段代码更快?
题
使用UNION ALL(非UNION)编写SQL查询,该查询使用WHERE子句来消除重复项.你为什么要这样做?隐藏答案您可以使用UNION ALL避免重复,并且通过运行如下查询仍然比UNION DISTINCT(实际上与UNION相同)运行得快得多:
回答
SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X
The key is the AND a!=X part. This gives you the benefits of the UNION (a.k.a., UNION DISTINCT) command, while avoiding much of its performance hit.
但在该示例中,第一个查询在列上具有条件a
,而第二个查询在列上具有条件b
.这可能来自一个难以优化的查询:
SELECT * FROM mytable WHERE a=X OR b=Y
This query is hard to optimize with simple B-tree indexing. Does the engine search an index on column a
? Or on column b
? Either way, searching the other term requires a table-scan.
Hence the trick of using UNION to separate into two queries for one term each. Each subquery can use the best index for each search term. Then combine the results using UNION.
But the two subsets may overlap, because some rows where b=Y
may also have a=X
in which case such rows occur in both subsets. Therefore you have to do duplicate elimination, or else see some rows twice in the final result.
SELECT * FROM mytable WHERE a=X UNION DISTINCT SELECT * FROM mytable WHERE b=Y
UNION DISTINCT
is expensive because typical implementations sort the rows to find duplicates. Just like if you use SELECT DISTINCT ...
.
We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets. It's a lot of rows to eliminate.
But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct. That is, if you guarantee there is no overlap. If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.
If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.
SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X
These two sets are guaranteed to have no overlap. If the first set has rows where a=X
and the second set has rows where a!=X
then there can be no row that is in both sets.
The second query therefore only catches some of the rows where b=Y
, but any row where a=X AND b=Y
is already included in the first set.
So the query achieves an optimized search for two OR
terms, without producing duplicates, and requiring no UNION DISTINCT
operation.