我有2个数据帧df1和df2。我想要这样的数据框的结果: 1.记录df1的所有记录。 2.仅从df2中获取新记录(df1中不可用的记录) 3.生成此逻辑的新数据框
注意::主键是“ id”。我只想检查ID,而不要检查完整的行。如果df1中没有ID,则仅是df2中的故事。
df1
+------+-------------+-----+
| id |time |other|
+------+-------------+-----+
| 111| 29-12-2019 | p1|
| 222| 29-12-2019 | p2|
| 333| 29-12-2019 | p3|
+----+-----+-----+---------+
df2
+------+-------------+-----+
| id |time |other|
+------+-------------+-----+
| 111| 30-12-2019 | p7|
| 222| 30-12-2019 | p8|
| 444| 30-12-2019 | p0|
+----+-----+-----+---------+
结果
+------+-------------+-----+
| id |time |other|
+------+-------------+-----+
| 111| 29-12-2019 | p1|
| 222| 29-12-2019 | p2|
| 333| 29-12-2019 | p3|
| 444| 30-12-2019 | p0|
+----+-----+-----+---------+
能帮我在pyspark中做到这一点吗?我打算使用join。
答案 0 :(得分:0)
df1=spark.createDataFrame([(111,'29-12-2019','p1'),(222,'29-12-2019','p2'),(333,'29-12-2019','p3')],['id','time','other'])
df2=spark.createDataFrame([(111,'30-12-2019','p7'),(222,'30-12-2019','p8'),(444,'30-12-2019','p0')],['id','time','other'])
mvv1 = df1.select("id").rdd.flatMap(lambda x: x).collect()
print(mvv1)
[111, 222, 333]
yy=",".join([str(x) for x in mvv1])
df2.registerTempTable("temp_df2")
sqlDF2 = sqlContext.sql("select * from temp_df2 where id not in ("+yy+")")
sqlDF2.show()
+---+----------+-----+
| id| time|other|
+---+----------+-----+
|444|30-12-2019| p0|
+---+----------+-----+
df1.union(sqlDF2).show()
+---+----------+-----+
| id| time|other|
+---+----------+-----+
|111|29-12-2019| p1|
|222|29-12-2019| p2|
|333|29-12-2019| p3|
|444|30-12-2019| p0|
+---+----------+-----+
答案 1 :(得分:0)
最后,我编写了这段代码,对于12,000,000行似乎工作正常,仅需5分钟即可完成构建。希望对其他人有帮助:
df1=spark.createDataFrame([(111,'29-12-2019','p1'),(222,'29-12-2019','p2'),(333,'29-12-2019','p3')],['id','time','other'])
df2=spark.createDataFrame([(111,'30-12-2019','p7'),(222,'30-12-2019','p8'),(444,'30-12-2019','p0')],['id','time','other'])
#So this is giving me all records which are not available in df1 dataset
new_input_df = df2.join(df1, on=['id'], how='left_anti')
#Now union df1(historic reocrds) and new_input_df which contains only new
final_df = df1.union(new_input_df)
final_df.show()