PySpark:如何在连接两个spark数据帧时将列分组为列表?

时间:2016-10-14 17:59:25

标签: python apache-spark pyspark

我想在Name上加入以下spark数据帧:

df1 = spark.createDataFrame([("Mark", 68), ("John", 59), ("Mary", 49)], ['Name', 'Weight'])

df2 = spark.createDataFrame([(31, "Mark"), (32, "Mark"), (41, "John"), (42, "John"), (43, "John")],[ 'Age', 'Name'])

但我希望结果是以下数据帧:

df3 = spark.createDataFrame([([31, 32], "Mark", 68), ([41, 42, 43], "John", 59), `(None, "Mary", 49)],[ 'Age', 'Name', 'Weight'])

2 个答案:

答案 0 :(得分:2)

您可以使用模块collect_list中的pyspark.sql.functions。它收集与给定键相关的给定列的所有值。如果您想要包含唯一元素的列表,请使用collect_set

import pyspark.sql.functions as F

df1 = spark.createDataFrame([("Mark", 68), ("John", 59), ("Mary", 49)], ['Name', 'Weight'])
df2 = spark.createDataFrame([(31, "Mark"), (32, "Mark"), (41, "John"), (42, "John"), (43, "John")],[ 'Age', 'Name'])

df2_grouped = df.groupBy("Name").agg(F.collect_list(F.col("Age")).alias("Age"))
df_joined = df2_grouped.join(df1, "Name", "outer")

df_joined.show()

结果:

+----+------------+------+
|Name|         Age|Weight|
+----+------------+------+
|Mary|        null|    49|
|Mark|    [32, 31]|    68|
|John|[42, 43, 41]|    59|
+----+------------+------+

答案 1 :(得分:0)

DataFrame等效于Spark SQL中的关系表。您可以分组,加入,然后选择。

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import *

sc = SparkContext()
sql = SQLContext(sc)

df1 = sql.createDataFrame([("Mark", 68), ("John", 59), ("Mary", 49)], ['Name', \
'Weight'])

df2 = sql.createDataFrame([(31, "Mark"), (32, "Mark"), (41, "John"), (42, "John\
"), (43, "John")],[ 'Age', 'Name'])

grouped = df2.groupBy(['Name']).agg(collect_list("Age").alias('age_list'))

joined_df = df1.join(grouped, df1.Name == grouped.Name, 'left_outer')
print(joined_df.select(grouped.age_list, df1.Name, df1.Weight).collect())

结果

[Row(age_list=None, Name=u'Mary', Weight=49), Row(age_list=[31, 32], Name=u'Mark', Weight=68), Row(age_list=[41, 42, 43], Name=u'John', Weight=59)]