如何使用group By子句将具有行数的列添加到DF?
+------------+-------+
| Category | txn |
+-----===----+-------+
| Cat1 | A |
| Cat1 | A |
| Cat1 | B |
+------------+-------+
所需的输出:
+------------+-------+-----+
| Category | txn | n |
+-----===----+-------+-----+
| Cat1 | A | 2 |
| Cat1 | A | 2 |
| Cat1 | B | 1 |
+------------+-------+-----+
我尝试了以下操作:
df.withColumn("n", df.groupBy("Category", "txn").count())
它返回:
type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Column
然后
df.withColumn("n", df.groupBy("Category", "txn").agg(count()))
它返回:
error: overloaded method value count with alternatives:
(columnName: String)org.apache.spark.sql.TypedColumn[Any,Long] <and>
(e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
cannot be applied to ()
答案 0 :(得分:2)
只需计数并联接:
val df = Seq(("C1","A"),("C1","A"),("C1","B")).toDF("Category", "Txn")
val countDf = df.groupBy(col("Category"), col("Txn")).count
countDf.show
+--------+---+-----+
|Category|Txn|count|
+--------+---+-----+
| C1| A| 2|
| C1| B| 1|
+--------+---+-----+
df.join(countDf, Seq("Category", "Txn"))
.withColumnRenamed("count", "n")
.show
+--------+---+---+
|Category|Txn| n|
+--------+---+---+
| C1| A| 2|
| C1| A| 2|
| C1| B| 1|
+--------+---+---+
希望有帮助
答案 1 :(得分:2)
scala> df.show
+--------+---+
|Category|txn|
+--------+---+
| Cat1| A|
| Cat1| A|
| Cat1| B|
+--------+---+
scala> import org.apache.spark.sql.expressions.Window
scala> val w = Window.partitionBy("Category","txn").orderBy(col("txn"))
scala> df.withColumn("n", dense_rank.over(w))
.withColumn("n", sum(col("n")).over(w))
.show
+--------+---+---+
|Category|txn| n|
+--------+---+---+
| Cat1| B| 1|
| Cat1| A| 2|
| Cat1| A| 2|
+--------+---+---+
答案 2 :(得分:1)
我认为实现所需目标的最简单方法是使用由txn
和count
函数划分的窗口。无需使用groupBy
,因为您要保留数据框的所有行。也不要订购该窗口,因为这对您来说毫无用处,并且会减慢该过程的速度。
import org.apache.spark.sql.expressions.Window
val df = Seq(("C1","A"),("C1","A"),("C1","B")).toDF("Category", "Txn")
val w = Window.partitionBy("txn")
df.withColumn("n", count('*) over w).show()
哪个产量
+--------+---+---+
|Category|Txn| n|
+--------+---+---+
| C1| B| 1|
| C1| A| 2|
| C1| A| 2|
+--------+---+---+