分组并查找计数,然后再执行数据透视

时间:2018-10-12 12:56:23

标签: scala apache-spark databricks

我有一个如下数据框

A   B   C       D
foo one small   1
foo one large   2
foo one large   2
foo two small   3

我需要根据A和B groupBy(基于C列和pivot D列)来sum

我可以使用

df.groupBy("A", "B").pivot("C").sum("D") 

但是,如果我尝试类似的操作,我还需要在count之后找到groupBy

df.groupBy("A", "B").pivot("C").agg(sum("D"), count)

我得到类似

的输出
A   B   large   small large_count small_count

有没有办法在执行count之前在groupBy之后只获得一个pivot

3 个答案:

答案 0 :(得分:0)

在输出时尝试

output.withColumn(“ count”,$“ large_count” + $“ small_count”)。show

如果需要,您可以删除两个计数列

在枢轴尝试之前执行此操作 df.groupBy(“ A”,“ B”)。agg(count(“ C”))

答案 1 :(得分:0)

这是您的期望吗?

val df = Seq(("foo", "one", "small",   1),
("foo", "one", "large",   2),
("foo", "one", "large",   2),
("foo", "two", "small",   3)).toDF("A","B","C","D")

scala> df.show
+---+---+-----+---+
|  A|  B|    C|  D|
+---+---+-----+---+
|foo|one|small|  1|
|foo|one|large|  2|
|foo|one|large|  2|
|foo|two|small|  3|
+---+---+-----+---+

scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]

scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]

scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
|  A|  B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one|   5|    4|    1|
|foo|two|   3| null|    3|
+---+---+----+-----+-----+


scala>

答案 2 :(得分:0)

这不需要加入。这是您要找的东西吗?

val df = Seq(("foo", "one", "small",   1),
("foo", "one", "large",   2),
("foo", "one", "large",   2),
("foo", "two", "small",   3)).toDF("A","B","C","D")

scala> df.show
+---+---+-----+---+
|  A|  B|    C|  D|
+---+---+-----+---+
|foo|one|small|  1|
|foo|one|large|  2|
|foo|one|large|  2|
|foo|two|small|  3|
+---+---+-----+---+

df.registerTempTable("dummy")

spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show

+---+---+-----+-----+-----+
|  A|  B|large|small|total|
+---+---+-----+-----+-----+
|foo|one|    4|    1|    5|
|foo|two| null|    3|    3|
+---+---+-----+-----+-----+