下面是我的数据框的结构。我需要根据ID,国家/地区和州进行分组,并分别汇总vectors_1和vector_2。限制是我需要使用pyspark 2.3。
Id Country State Vector_1 Vector_2
1 US IL [1.0,2.0,3.0,4.0,5.0] [5.0,5.0,5.0,5.0,5.0]
1 US IL [5.0,3.0,3.0,2.0,1.0] [5.0,5.0,5.0,5.0,5.0]
2 US TX [6.0,7.0,8.0,9.0,1.0] [1.0,1.0,1.0,1.0,1.0]
输出应如下所示
Id Country State Vector_1 Vector_2
1 US IL [6.0,5.0,6.0,6.0,6.0] [10.0,10.0,10.0,10.0,10.0]
2 US TX [6.0,7.0,8.0,9.0,1.0] [1.0,1.0,1.0,1.0,1.0]
这是代码。但是我正在寻找使用lambda的其他方法。请指示。
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import Window
from pyspark.sql.types import *
import numpy as np
schema = StructType([
StructField("Country", StringType()),
StructField("State", StringType()),
StructField("Vector_1", ArrayType(DoubleType())),
StructField("Vector_2", ArrayType(DoubleType()))
])
@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def g(df):
gr1 = df['Country'].iloc[0]
gr2 = df['State'].iloc[0]
a = np.sum(df.Vector_1)
b = np.sum(df.Vector_2)
return pd.DataFrame([[gr1]+[gr2]+[a]+[b]])
df.groupby("Country","State").apply(g).show()