鉴于此Dataframe
:
+---+---+
| c1| c2|
+---+---+
| A| 1|
| A| 2|
| A| 1|
| B| 3|
| B| 4|
| B| 4|
+---+---+
我想计算每个值c1,c2的值更频繁
+---+---+
| c1| c2|
+---+---+
| A| 1|
| B| 4|
+---+---+
这是我当前的代码(Spark 1.6.0)
val df = sc.parallelize(Seq(("A", 1), ("A", 2), ("A", 1), ("B", 3), ("B", 4), ("B", 4))).toDF("c1", "c2")
df.groupBy("c1", "c2")
.count()
.groupBy("c1")
.agg(max(struct(col("count"), col("c2"))).as("max"))
.select("c1", "max.c2")
有更好的方法吗?
答案 0 :(得分:1)
如果您习惯使用Spark SQL,则以下实现可行。 请注意,Spark SQL中的窗口函数可从Spark 1.4开始提供。
df.registerTempTable("temp_table")
sqlContext.sql
("""
SELECT c1,c2 FROM
(SELECT c1,c2, RANK() OVER(PARTITION BY c1 ORDER BY cnt DESC) as rank FROM (
SELECT c1,c2,count(*) as cnt FROM temp_table GROUP BY c1,c2) t0) t1
WHERE t1.rank = 1
""").show()
答案 1 :(得分:0)
val df = sc.parallelize(Seq(("A", 1), ("A", 2), ("A", 1), ("B", 3), ("B", 4), ("B", 4))).toDF("c1", "c2")
import org.apache.spark.sql.expressions.Window
val overCategory = Window.partitionBy($“c1”,$“c2”)。orderBy($“c2”.desc)
val countd = df.withColumn(“count”,count($“c2”)。over(overCategory))。dropDuplicates
val freqCategory = countd.withColumn(“max”,max($“count”)。over(Window.partitionBy($“c1”)))。filter($“count”=== $“max”) .drop(“count”,“max”)