Question

我有以下示例数据集：

groupby prevoius    current
A       1           1
A       0           1
A       0           0
A       1           0
A       1           1
A       0           1

我想通过汇总“上一个”和“当前”列来创建下表。

previous_total   current_total
3                4

我已经尝试使用.agg来组合groupby的所有组合，并尝试实现上表，但无法成功运行任何内容。

我也知道如何在Python Pandas中做到这一点，但不了解Pyspark。

Answer 1

使用self::tr[@class='abc']和sum方法：

groupBy

此外，您可以将数据框注册为临时表并使用Spark SQL查询它，这将得到相同的结果：

>>> df.groupBy().sum().select(col("sum(previous)").alias("previous_total"), col("sum(current)").alias("current_total")).show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

Answer 2

您可以使用和sum：

from pyspark.sql.functions import sum

df_result = df.select(sum("previous").alias("previous_total"),
                      sum("current").alias("current_total"))

df_result.show()

+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

PySpark Pandas：Groupby标识列并将两个不同的列求和以创建新的2x2表

2 个答案: