Pyspark-通过比较不同数据框中的值,根据条件更新数据框

时间:2020-10-20 09:28:39

标签: dataframe apache-spark pyspark apache-spark-sql

我有2个数据框:

old_df-将是一个固定的数据帧; new_df-每天都在变化。

对于每个Id,将基于new_df中的age列的值来更新old_df中的年龄,并且只要有更新,计数器就会增加1。如果年龄没有变化,则计数器和年龄列将保持相同的值(无增量)

old_df :(编辑,包含第5个ID)

id age counter
1   12   0
2   15   0
3   22   0
4   19   0
5   11   0

new_df

id  age 
1   20   
2   21   
3   22   
4   19 
现在,old_df的

输出应为:

old_df:

id age counter
1   20   1
2   21   1
3   22   0
4   19   0
5   11   0  

直到现在我已经尝试了以下方法:

df_old = df_old.withColumnRenamed('id','id_old')\
.withColumnRenamed('age','age_old')

joinedDF = df_old.join(df_new, df_new["id"] == df_old["id_old"], "outer")

if(joinedDF.select(joinedDF.age_old != joinedDF.age)):
        joinedDF = joinedDF.withColumn("age_old",joinedDF['age'])
        joinedDF = joinedDF.withColumn("counter",joinedDF['counter']+1)


joinedDF[['id_old', 'age_old', 'counter']].toPandas()


id_old age_old counter
1         20   1
2         21   1
3         22   1
4         19   1

如果您可以看到id_old = 3和4的输出,则应该将计数器值设置为0,但将计数器值设置为1。 感谢您的帮助

1 个答案:

答案 0 :(得分:0)

这可以达到相同的目的

row = Row('id', 'age','counter')
old_df = spark.createDataFrame([row(1, 12, 0), row(2, 15, 0), row(3, 22, 0), row(4, 19, 0)])
old_df.show()

row2 = Row('id', 'age')
new_df = spark.createDataFrame([row(1, 20), row(2, 21), row(3, 22), row(4, 19)])
new_df.show()

old_df = old_df.alias("old_df").join(new_df.alias("new_df"), old_df.id == new_df.id, "inner").selectExpr("old_df.id as id", "new_df.age as age ",
                                                                          "if(old_df.age != new_df.age, old_df.counter+1,old_df.counter) as counter").sort("id")

old_df.show()

输出:

+---+---+-------+
| id|age|counter|
+---+---+-------+
|  1| 12|      0|
|  2| 15|      0|
|  3| 22|      0|
|  4| 19|      0|
+---+---+-------+

+---+---+
| id|age|
+---+---+
|  1| 20|
|  2| 21|
|  3| 22|
|  4| 19|
+---+---+

+---+---+-------+
| id|age|counter|
+---+---+-------+
|  1| 20|      1|
|  2| 21|      1|
|  3| 22|      0|
|  4| 19|      0|
+---+---+-------+