例如,如https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
>>> A >>> B
lkey value rkey value
0 foo 1 0 foo 5
1 bar 2 1 bar 6
2 baz 3 2 qux 7
3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 4 foo 5
2 bar 2 bar 6
3 bar 2 bar 8
4 baz 3 NaN NaN
5 NaN NaN qux 7
我想知道
lkey
和rkey
合并为一列,以补充双方的缺失值?答案 0 :(得分:2)
如何在Pyspark中做到这一点?
您正在寻找的是join
A.join(other=B, on=(A['lkey'] == B['rkey']), how='outer')\
.select(A['lkey'], A['value'].alias('value_x'), B['rkey'], B['value'].alias('value_y'))\
.show(truncate=False)
应该给您
+----+-------+----+-------+
|lkey|value_x|rkey|value_y|
+----+-------+----+-------+
|bar |2 |bar |6 |
|bar |2 |bar |8 |
|null|null |qux |7 |
|foo |1 |foo |5 |
|foo |4 |foo |5 |
|baz |3 |null|null |
+----+-------+----+-------+
要更进一步,我如何将lkey和rkey合并为一列,从两侧补充丢失的值?
您可以rename
列并将join
用作
from pyspark.sql.functions import col
A.select(col('lkey').alias('key'), col('value').alias('value_x'))\
.join(other=B.select(col('rkey').alias('key'), col('value').alias('value_y')), on=['key'], how='outer')\
.show(truncate=False)
应该给您
+---+-------+-------+
|key|value_x|value_y|
+---+-------+-------+
|bar|2 |6 |
|bar|2 |8 |
|qux|null |7 |
|foo|1 |5 |
|foo|4 |5 |
|baz|3 |null |
+---+-------+-------+
我希望答案会有所帮助