在Spark中汇总多个列

时间:2017-06-12 14:35:17

标签: apache-spark pyspark sparkr

如何在Spark中汇总多个列?例如,在SparkR中,以下代码用于获取一列的总和,但如果我尝试在df中获取两列的总和,则会出现错误。

# Create SparkDataFrame
df <- createDataFrame(faithful)

# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works

# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work

SparkR或PySpark代码都可以使用。

4 个答案:

答案 0 :(得分:7)

对于PySpark,如果您不想明确地输入列:

from operator import add
from functools import reduce
new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))

答案 1 :(得分:3)

org.apache.spark.sql.functions.sum(Column e)
  

聚合函数:返回表达式中所有值的总和。

正如您所看到的,sum只需要一列作为输入,因此sum(df$waiting, df$eruptions)无法工作。由于您想要总结数字字段,您可以sum(df("waiting") + df("eruptions"))。如果您想要然后,为了总结各列的值,您可以df.agg(sum(df$waiting),sum(df$eruptions)).show

答案 2 :(得分:2)

您可以在pyspark

中执行以下操作
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([("a",1,10), ("b",2,20), ("c",3,30), ("d",4,40)], ["col1", "col2", "col3"])
>>> df.groupBy("col1").agg(F.sum(df.col2+df.col3)).show()
+----+------------------+
|col1|sum((col2 + col3))|
+----+------------------+
|   d|                44|
|   c|                33|
|   b|                22|
|   a|                11|
+----+------------------+

答案 3 :(得分:1)

sparkR代码:

library(SparkR)
df <- createDataFrame(sqlContext,faithful)
w<-agg(df,sum(df$waiting)),agg(df,sum(df$eruptions))
head(w[[1]])
head(w[[2]])