在PySpark中加入多个数据框

时间:2019-06-12 06:50:28

标签: python apache-spark pyspark apache-spark-sql

我有以下几个数据帧,每个数据帧都有两列,并且行数完全相同。如何加入它们,以便获得一个包含两个列以及两个数据框中的所有行的单个数据框?

例如:

DataFrame-1

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
+--------------+-------------+

DataFrame-2

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_1_URI |  1          |
| sample_1_URI |  1          |
+--------------+-------------+

DataFrame-3

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_2_URI |  2          |
| sample_2_URI |  2          |
+--------------+-------------+

DataFrame-4

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

...

我希望加入的结果是:

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
| sample_1_URI |  1          |
| sample_1_URI |  1          |
| sample_2_URI |  2          |
| sample_2_URI |  2          |
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

现在,如果我想对 label 列进行一键编码,应该是这样的:

oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>

1 个答案:

答案 0 :(得分:0)

您正在寻找union

在这种情况下,我要做的是将数据帧放在list中并使用reduce

from functools import reduce

dataframes = [df_1, df_2, df_3, df_4]

result = reduce(lambda first, second: first.union(second), dataframes)