在第二个数据框中具有多行的Pyspark连接数据框

时间:2018-07-14 22:15:56

标签: python apache-spark pyspark apache-spark-sql

我想在名为“ TrackID”的列上将数据框“ df_1”与“ df_2”连接起来。

 df_1:   cluster    TrackID
           1           a_1
           2           a_1
           3           a_2
           1           a_3

 df_2:   TrackID     Value
           a_1         5
           a_1         6
           a_2         7
           a_2         8
           a_3         9
Output:   
         cluster    TrackID   Value
          1           a_1    Vector(5,6)
          2           a_1    Vector(5,6)
          3           a_2    Vector(7,8)
          1           a_3    Vetor(9)

我希望连接的输出看起来像这样。有办法吗?

2 个答案:

答案 0 :(得分:1)

如果您对 ArrayType 没问题,可以先按 TrackID 聚合第二个数据框,然后再与第一个数据框合并:

import pyspark.sql.functions as F

df_2.groupBy('TrackID').agg(
    F.collect_list('Value').alias('Value')
).join(df_1, ['TrackID']).show()

+-------+------+-------+
|TrackID| Value|cluster|
+-------+------+-------+
|    a_1|[5, 6]|      1|
|    a_1|[5, 6]|      2|
|    a_2|[7, 8]|      3|
|    a_3|   [9]|      1|
+-------+------+-------+

答案 1 :(得分:0)

我只是添加一个udf,以将收集到的列表转换为@Psidom答案中的向量

#importing necessary libraries
from pyspark.sql.functions import udf, collect_list, col
from pyspark.ml.linalg import Vectors, VectorUDT

#udf for changing the collected list to vector
@udf(VectorUDT())
def vectorUdf(x):
    return Vectors.dense(x)

#grouping and aggregation for collecting values and calling the above udf function
vectorDf_2 = df_2.groupBy('TrackID').agg(vectorUdf(collect_list('Value')).alias('Value'))

#joining the two dataframes
Output = df_1.join(vectorDf_2, ['TrackID'])

应该给您

+-------+-------+---------+
|TrackID|cluster|Value    |
+-------+-------+---------+
|a_1    |1      |[5.0,6.0]|
|a_1    |2      |[5.0,6.0]|
|a_2    |3      |[7.0,8.0]|
|a_3    |1      |[9.0]    |
+-------+-------+---------+

root
 |-- TrackID: string (nullable = true)
 |-- cluster: long (nullable = true)
 |-- Value: vector (nullable = true)

我希望答案会有所帮助