添加包含按df分组的列数og的列

时间:2019-11-25 11:27:45

标签: scala dataframe apache-spark group-by

如何使用group By子句将具有行数的列添加到DF?

+------------+-------+
|  Category  |  txn  | 
+-----===----+-------+  
|  Cat1      |   A   |  
|  Cat1      |   A   |
|  Cat1      |   B   |
+------------+-------+

所需的输出:

+------------+-------+-----+
|  Category  |  txn  |  n  |
+-----===----+-------+-----+  
|  Cat1      |   A   |  2  |
|  Cat1      |   A   |  2  |   
|  Cat1      |   B   |  1  |
+------------+-------+-----+

我尝试了以下操作:

 df.withColumn("n", df.groupBy("Category", "txn").count())

它返回:

 type mismatch;
 found   : org.apache.spark.sql.DataFrame
    (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 required: org.apache.spark.sql.Column

然后

df.withColumn("n", df.groupBy("Category", "txn").agg(count()))

它返回:

 error: overloaded method value count with alternatives:
  (columnName: String)org.apache.spark.sql.TypedColumn[Any,Long] <and>
  (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
 cannot be applied to ()

3 个答案:

答案 0 :(得分:2)

只需计数并联接:

val df = Seq(("C1","A"),("C1","A"),("C1","B")).toDF("Category", "Txn")

val countDf = df.groupBy(col("Category"), col("Txn")).count
countDf.show
+--------+---+-----+
|Category|Txn|count|
+--------+---+-----+
|      C1|  A|    2|
|      C1|  B|    1|
+--------+---+-----+

df.join(countDf, Seq("Category", "Txn"))
  .withColumnRenamed("count", "n")   
  .show
+--------+---+---+
|Category|Txn|  n|
+--------+---+---+
|      C1|  A|  2|
|      C1|  A|  2|
|      C1|  B|  1|
+--------+---+---+

希望有帮助

答案 1 :(得分:2)

scala> df.show
+--------+---+
|Category|txn|
+--------+---+
|    Cat1|  A|
|    Cat1|  A|
|    Cat1|  B|
+--------+---+

scala> import org.apache.spark.sql.expressions.Window

scala> val w = Window.partitionBy("Category","txn").orderBy(col("txn"))

scala> df.withColumn("n", dense_rank.over(w))
         .withColumn("n", sum(col("n")).over(w))
         .show
+--------+---+---+
|Category|txn|  n|
+--------+---+---+
|    Cat1|  B|  1|
|    Cat1|  A|  2|
|    Cat1|  A|  2|
+--------+---+---+

答案 2 :(得分:1)

我认为实现所需目标的最简单方法是使用由txncount函数划分的窗口。无需使用groupBy,因为您要保留数据框的所有行。也不要订购该窗口,因为这对您来说毫无用处,并且会减慢该过程的速度。

import org.apache.spark.sql.expressions.Window
val df = Seq(("C1","A"),("C1","A"),("C1","B")).toDF("Category", "Txn")
val w = Window.partitionBy("txn")
df.withColumn("n", count('*) over w).show()

哪个产量

+--------+---+---+
|Category|Txn|  n|
+--------+---+---+
|      C1|  B|  1|
|      C1|  A|  2|
|      C1|  A|  2|
+--------+---+---+