Question

拥有以下数据

+-----------------+-------------+----------+-----------------+-------------+----------+
| firstgroupId    | firstCount  |  firstId | secondclusterId | secondCount | secondId |
+-----------------+-------------+----------+-----------------+-------------+----------+
| 100001          | 3           | 3000001  | 100003          | 4           | 3000001  |
| 100001          | 3           | 3000002  | 100003          | 4           | 3000002  |
| 100001          | 3           | 3000003  | 100003          | 4           | 3000003  |
| 100002          | 2           | 3000004  | 100003          | 4           | 3000004  |
| 100002          | 2           | 3000005  | 100002          | 4           | 3000005  |
| 100003          | 3           | 3000006  | 100002          | 4           | 3000006  |
| 100003          | 3           | 3000007  | 100002          | 4           | 3000007  |
| 100003          | 3           | 3000008  | 100002          | 4           | 3000008  |
| 100004          | 2           | 3000009  | 100005          | 2           | 3000009  |
| 100004          | 2           | 3000010  | 100005          | 2           | 3000010  |
+-----------------+-------------+----------+-----------------+-------------+----------+

这里我们可以看到

for firstId 3000001,3000002,3000003组合在一起但是secondId 3000001,3000002,3000003,3000004组合在一起这里我需要结果为3000004，因为这是奇怪的人
相同的3000005：3000005和3000004在firstPart中连接在一起但在第二部分中没有连接

需要通过比较两组Id来找出奇怪的人吗？

Answer 1

您似乎需要ID，其中组和群集之间的重叠不是“最大”重叠。如果我这样，我认为这样做你想要的：

select t.*
from (select t.*,
             row_number() over (partition by firstgroupId order by overlap_count desc) as overlap_rank
      from (select t.*,
                   count(*) over (partition by firstgroupId, secondclusterId) as overlap_count
            from t
           ) t
     ) t
where overlap_rank > 1;

如果您知道要查找单例异常值，则只能使用最里面的子查询并使用where overlap_count = 1。

Vertica查询以通过结果获得两个组之间的差异

1 个答案: