Question

我有2个数据帧df1和df2。我想要这样的数据框的结果： 1.记录df1的所有记录。 2.仅从df2中获取新记录（df1中不可用的记录） 3.生成此逻辑的新数据框

注意：：主键是“ id”。我只想检查ID，而不要检查完整的行。如果df1中没有ID，则仅是df2中的故事。

df1

    +------+-------------+-----+
    |  id  |time         |other|
    +------+-------------+-----+
    |   111|  29-12-2019 |   p1|
    |   222|  29-12-2019 |   p2|
    |   333|  29-12-2019 |   p3|
    +----+-----+-----+---------+

df2

    +------+-------------+-----+
    |  id  |time         |other|
    +------+-------------+-----+
    |   111|  30-12-2019 |   p7|
    |   222|  30-12-2019 |   p8|
    |   444|  30-12-2019 |   p0|
    +----+-----+-----+---------+

结果

+------+-------------+-----+
|  id  |time         |other|
+------+-------------+-----+
|   111|  29-12-2019 |   p1|
|   222|  29-12-2019 |   p2|
|   333|  29-12-2019 |   p3|
|   444|  30-12-2019 |   p0|
+----+-----+-----+---------+

能帮我在pyspark中做到这一点吗？我打算使用join。

Answer 1

df1=spark.createDataFrame([(111,'29-12-2019','p1'),(222,'29-12-2019','p2'),(333,'29-12-2019','p3')],['id','time','other'])
df2=spark.createDataFrame([(111,'30-12-2019','p7'),(222,'30-12-2019','p8'),(444,'30-12-2019','p0')],['id','time','other'])

mvv1 = df1.select("id").rdd.flatMap(lambda x: x).collect()
print(mvv1)

[111, 222, 333]

yy=",".join([str(x) for x in mvv1])
df2.registerTempTable("temp_df2")
sqlDF2 = sqlContext.sql("select * from temp_df2 where id not in ("+yy+")")
sqlDF2.show()

+---+----------+-----+
| id|      time|other|
+---+----------+-----+
|444|30-12-2019|   p0|
+---+----------+-----+

df1.union(sqlDF2).show()

+---+----------+-----+
| id|      time|other|
+---+----------+-----+
|111|29-12-2019|   p1|
|222|29-12-2019|   p2|
|333|29-12-2019|   p3|
|444|30-12-2019|   p0|
+---+----------+-----+

Answer 2

最后，我编写了这段代码，对于12,000,000行似乎工作正常，仅需5分钟即可完成构建。希望对其他人有帮助：

df1=spark.createDataFrame([(111,'29-12-2019','p1'),(222,'29-12-2019','p2'),(333,'29-12-2019','p3')],['id','time','other'])
df2=spark.createDataFrame([(111,'30-12-2019','p7'),(222,'30-12-2019','p8'),(444,'30-12-2019','p0')],['id','time','other'])

#So this is giving me all records which are not available in df1 dataset
new_input_df = df2.join(df1, on=['id'], how='left_anti')

#Now union df1(historic reocrds) and new_input_df  which contains only new 
final_df = df1.union(new_input_df)

final_df.show()

Pyspark：加入2个数据框仅从第2个数据框中获取新记录（历史记录）

2 个答案: