我有2个数据框:
old_df-将是一个固定的数据帧; new_df-每天都在变化。
对于每个Id,将基于new_df中的age列的值来更新old_df中的年龄,并且只要有更新,计数器就会增加1。如果年龄没有变化,则计数器和年龄列将保持相同的值(无增量)
old_df :(编辑,包含第5个ID)
id age counter
1 12 0
2 15 0
3 22 0
4 19 0
5 11 0
new_df
id age
1 20
2 21
3 22
4 19
现在,old_df的输出应为:
old_df:
id age counter
1 20 1
2 21 1
3 22 0
4 19 0
5 11 0
直到现在我已经尝试了以下方法:
df_old = df_old.withColumnRenamed('id','id_old')\
.withColumnRenamed('age','age_old')
joinedDF = df_old.join(df_new, df_new["id"] == df_old["id_old"], "outer")
if(joinedDF.select(joinedDF.age_old != joinedDF.age)):
joinedDF = joinedDF.withColumn("age_old",joinedDF['age'])
joinedDF = joinedDF.withColumn("counter",joinedDF['counter']+1)
joinedDF[['id_old', 'age_old', 'counter']].toPandas()
id_old age_old counter
1 20 1
2 21 1
3 22 1
4 19 1
如果您可以看到id_old = 3和4的输出,则应该将计数器值设置为0,但将计数器值设置为1。 感谢您的帮助
答案 0 :(得分:0)
这可以达到相同的目的
row = Row('id', 'age','counter')
old_df = spark.createDataFrame([row(1, 12, 0), row(2, 15, 0), row(3, 22, 0), row(4, 19, 0)])
old_df.show()
row2 = Row('id', 'age')
new_df = spark.createDataFrame([row(1, 20), row(2, 21), row(3, 22), row(4, 19)])
new_df.show()
old_df = old_df.alias("old_df").join(new_df.alias("new_df"), old_df.id == new_df.id, "inner").selectExpr("old_df.id as id", "new_df.age as age ",
"if(old_df.age != new_df.age, old_df.counter+1,old_df.counter) as counter").sort("id")
old_df.show()
输出:
+---+---+-------+
| id|age|counter|
+---+---+-------+
| 1| 12| 0|
| 2| 15| 0|
| 3| 22| 0|
| 4| 19| 0|
+---+---+-------+
+---+---+
| id|age|
+---+---+
| 1| 20|
| 2| 21|
| 3| 22|
| 4| 19|
+---+---+
+---+---+-------+
| id|age|counter|
+---+---+-------+
| 1| 20| 1|
| 2| 21| 1|
| 3| 22| 0|
| 4| 19| 0|
+---+---+-------+