我是Spark的新手,需要支持来解决以下问题。我有如下数据:
Country value
India [1,2,3,4,5]
US [8,9,10,11,12]
US [7,6,5,4,3]
India [8,7,6,5,4]
所需的输出是以下同一国家/地区的vector元素的总和。
Output:
Country value
India [9,9,9,9,9]
US [15,15,15,15,15]
答案 0 :(得分:0)
AFAIK,spark不提供数组的聚合功能。因此,如果数组的大小是固定的,则可以为数组的每个元素创建一列,进行汇总,然后重新创建数组。
一般来说,可以如下:
from pyspark.sql.functions import col, sum
# first, let's get the size of the array
size = len(df.first()['value'])
# Then, summing each element separately:
aggregation = df.groupBy("country")\
.agg(*[sum(df.value.getItem(i)).alias("v"+str(i)) for i in range(size)])
aggregation.show()
+-------+---+---+---+---+---+
|country| v0| v1| v2| v3| v4|
+-------+---+---+---+---+---+
| India| 9| 9| 9| 9| 9|
| US| 15| 15| 15| 15| 15|
+-------+---+---+---+---+---+
# Finally, we recreate the array
result = aggregation.select(df.country,\
functions.array(*[col("v"+str(i)) for i in range(size)]).alias("value"))
result.show()
+-------+--------------------+
|country| value|
+-------+--------------------+
| India| [9, 9, 9, 9, 9]|
| US|[15, 15, 15, 15, 15]|
+-------+--------------------+