考虑重复的表
+-----+----+---+
|asin |ctx |fo |
+-----+----+---+
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO2|
|ASIN1|CTX2|FO1|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO3|
|ASIN1|CTX3|FO1|
|ASIN1|CTX3|FO3|
+-----+----+---+
如果我将记录分组以计算所有重复记录
+-----+----+---+-----+
| asin| ctx| fo|count|
+-----+----+---+-----+
|ASIN1|CTX1|FO1| 2|
|ASIN1|CTX2|FO2| 2|
|ASIN1|CTX1|FO2| 1|
|ASIN1|CTX3|FO1| 1|
|ASIN1|CTX2|FO1| 1|
|ASIN1|CTX2|FO3| 1|
|ASIN1|CTX3|FO3| 1|
+-----+----+---+-----+
,然后,如果我们将(asin, ctx)
上的记录重新分组,并尝试用max(count)
确定该行,则上述一种流行的方式(我们称之为流行方式) https://stackoverflow.com/a/6792744/977038中的答案是按计数进行汇总,然后与原始表联接,类似于
SELECT df_count.asin,
df_count.ctx,
max(df_count.fo) as fo,
df_count.count as max_count
FROM db.spark_df_count as df_count
INNER JOIN (
SELECT asin,
ctx,
max(count) as max_count
FROM db.spark_df_count
GROUP BY asin, ctx
) as df_count_max
ON df_count_max.asin = df_count.asin
AND df_count_max.ctx = df_count.ctx
AND df_count_max.max_count = count
GROUP BY df_count.asin, df_count.ctx, df_count.count
ORDER BY max_count DESC
结果为
+-----+----+---+---------+
| asin| ctx| fo|max_count|
+-----+----+---+---------+
|ASIN1|CTX2|FO2| 2|
|ASIN1|CTX1|FO1| 2|
|ASIN1|CTX3|FO3| 1|
+-----+----+---+---------+
现在,我想出了一种使用数组内置函数的替代方法,并得到了相似的结果
SELECT asin,
ctx,
array_max(collect_set(struct(count, fo)))['fo'] as fo,
max(count) as max_count
FROM db.spark_df_count
GROUP BY asin, ctx
现在
我想问社区,他们是否更喜欢替代方法,而不是流行方法?
注意:该特定问题已标记为SQL,正在查找特定于SQL的特定答案以及每个答案的陷阱。