我想在名为“ TrackID”的列上将数据框“ df_1”与“ df_2”连接起来。
df_1: cluster TrackID
1 a_1
2 a_1
3 a_2
1 a_3
df_2: TrackID Value
a_1 5
a_1 6
a_2 7
a_2 8
a_3 9
Output:
cluster TrackID Value
1 a_1 Vector(5,6)
2 a_1 Vector(5,6)
3 a_2 Vector(7,8)
1 a_3 Vetor(9)
我希望连接的输出看起来像这样。有办法吗?
答案 0 :(得分:1)
如果您对 ArrayType 没问题,可以先按 TrackID 聚合第二个数据框,然后再与第一个数据框合并:
import pyspark.sql.functions as F
df_2.groupBy('TrackID').agg(
F.collect_list('Value').alias('Value')
).join(df_1, ['TrackID']).show()
+-------+------+-------+
|TrackID| Value|cluster|
+-------+------+-------+
| a_1|[5, 6]| 1|
| a_1|[5, 6]| 2|
| a_2|[7, 8]| 3|
| a_3| [9]| 1|
+-------+------+-------+
答案 1 :(得分:0)
我只是添加一个udf
,以将收集到的列表转换为@Psidom答案中的向量
#importing necessary libraries
from pyspark.sql.functions import udf, collect_list, col
from pyspark.ml.linalg import Vectors, VectorUDT
#udf for changing the collected list to vector
@udf(VectorUDT())
def vectorUdf(x):
return Vectors.dense(x)
#grouping and aggregation for collecting values and calling the above udf function
vectorDf_2 = df_2.groupBy('TrackID').agg(vectorUdf(collect_list('Value')).alias('Value'))
#joining the two dataframes
Output = df_1.join(vectorDf_2, ['TrackID'])
应该给您
+-------+-------+---------+
|TrackID|cluster|Value |
+-------+-------+---------+
|a_1 |1 |[5.0,6.0]|
|a_1 |2 |[5.0,6.0]|
|a_2 |3 |[7.0,8.0]|
|a_3 |1 |[9.0] |
+-------+-------+---------+
root
|-- TrackID: string (nullable = true)
|-- cluster: long (nullable = true)
|-- Value: vector (nullable = true)
我希望答案会有所帮助