如何使用spark计算更频繁的值

时间:2017-01-28 21:16:19

标签: apache-spark

鉴于此Dataframe

+---+---+
| c1| c2|
+---+---+
|  A|  1|
|  A|  2|
|  A|  1|
|  B|  3|
|  B|  4|
|  B|  4|
+---+---+

我想计算每个值c1,c2的值更频繁

+---+---+
| c1| c2|
+---+---+
|  A|  1|
|  B|  4|
+---+---+

这是我当前的代码(Spark 1.6.0)

val df = sc.parallelize(Seq(("A", 1), ("A", 2), ("A", 1), ("B", 3), ("B", 4), ("B", 4))).toDF("c1", "c2")
df.groupBy("c1", "c2")
  .count()
  .groupBy("c1")
  .agg(max(struct(col("count"), col("c2"))).as("max"))
  .select("c1", "max.c2")

有更好的方法吗?

2 个答案:

答案 0 :(得分:1)

如果您习惯使用Spark SQL,则以下实现可行。 请注意,Spark SQL中的窗口函数可从Spark 1.4开始提供。

df.registerTempTable("temp_table")

sqlContext.sql
("""
SELECT c1,c2 FROM 
(SELECT c1,c2, RANK() OVER(PARTITION BY c1 ORDER BY cnt DESC) as rank FROM (
SELECT c1,c2,count(*) as cnt FROM temp_table GROUP BY c1,c2) t0) t1 
WHERE t1.rank = 1
""").show()

答案 1 :(得分:0)

val df = sc.parallelize(Seq(("A", 1), ("A", 2), ("A", 1), ("B", 3), ("B", 4), ("B", 4))).toDF("c1", "c2")  

import org.apache.spark.sql.expressions.Window

val overCategory = Window.partitionBy($“c1”,$“c2”)。orderBy($“c2”.desc)

val countd = df.withColumn(“count”,count($“c2”)。over(overCategory))。dropDuplicates

val freqCategory = countd.withColumn(“max”,max($“count”)。over(Window.partitionBy($“c1”)))。filter($“count”=== $“max”) .drop(“count”,“max”)