Pyspark:如何合并像Pandas这样的数据框?

时间:2018-06-29 07:26:42

标签: python pandas apache-spark pyspark

例如,如https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

>>> A              >>> B
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8

>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7

我想知道

  1. 如何在Pyspark中做到这一点?
  2. 要采取进一步的措施,如何将lkeyrkey合并为一列,以补充双方的缺失值?

1 个答案:

答案 0 :(得分:2)

  

如何在Pyspark中做到这一点?

您正在寻找的是join

A.join(other=B, on=(A['lkey'] == B['rkey']), how='outer')\
    .select(A['lkey'], A['value'].alias('value_x'), B['rkey'], B['value'].alias('value_y'))\
    .show(truncate=False)

应该给您

+----+-------+----+-------+
|lkey|value_x|rkey|value_y|
+----+-------+----+-------+
|bar |2      |bar |6      |
|bar |2      |bar |8      |
|null|null   |qux |7      |
|foo |1      |foo |5      |
|foo |4      |foo |5      |
|baz |3      |null|null   |
+----+-------+----+-------+
  

要更进一步,我如何将lkey和rkey合并为一列,从两侧补充丢失的值?

您可以rename列并将join用作

from pyspark.sql.functions import col
A.select(col('lkey').alias('key'), col('value').alias('value_x'))\
    .join(other=B.select(col('rkey').alias('key'), col('value').alias('value_y')), on=['key'], how='outer')\
    .show(truncate=False)

应该给您

+---+-------+-------+
|key|value_x|value_y|
+---+-------+-------+
|bar|2      |6      |
|bar|2      |8      |
|qux|null   |7      |
|foo|1      |5      |
|foo|4      |5      |
|baz|3      |null   |
+---+-------+-------+

我希望答案会有所帮助