Question

我有一个列，得分，它是1到5之间的整数。我试图从每个分数中选择n（在这种情况下为2000）样本。我自己的黑客攻击和其他SO问题让我构建了以下查询：

select * from (select text, score from data where score= 1 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 2 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 3 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 4 and LENGTH(text) > 45 limit 2000)
union
select * from (select text, score from data where score= 5 and LENGTH(text) > 45 limit 2000)

这感觉就像这样做的最糟糕的方式，更多的是当我单独运行每个查询时，它给出了我预期的2k结果，但是当我运行这个联合时，我得到的行数少于10k 我正在寻找帮助优化这个查询一点，但更重要的是我想了解为什么联盟返回错误的结果数

Answer 1

关于您的查询返回错误数量的结果的原因，我敢打赌，在每个查询返回的结果集中，您的数据不是distinct。使用union时，它会在整个结果集中返回distinct行。

尝试将其更改为union all：

select * from (select text, score from data where score= 1 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 2 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 3 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 4 and LENGTH(text) > 45 limit 2000)
union all
select * from (select text, score from data where score= 5 and LENGTH(text) > 45 limit 2000)

Here's a condensed demo showing the difference.

如果您有一个主键，例如自动增量，那么这是另一种为每组分数生成row_number的方法（这假定为id主键）：

select text, score
from (
  select text, score, 
         (select count(*) from data b 
          where a.id >= b.id and 
                a.score = b.score and 
                length(b.text) > 45) rn
  from data a
  where length(text) > 45
  ) t
where rn <= 2000

Answer 2

默认情况下，UNION会比较所有行并仅返回不同的行。这就是为什么你收到不到10k的原因。正如sgeddes所说，使用UNION ALL来获取所有10k行，包括重复行。你确实想要重复的行，不是吗？

从每个类别中选择n个样本

2 个答案: