我有两个数据帧可以说是dfA和dfB。
dfA:
IdCol | Col2 | Col3
id1 | val2 | val3
dfB:
IdCol | Col2 | Col3
id1 | val2 | val4
两个数据框在IdCol中加入。我希望每行比较它们并保持列不同,并将它们的值保存在另一个数据帧中。例如,从上面两个数据帧中我想得到一个结果:
dfChanges:
RowId | Col | dfA_value | dfB_value |
id1 | Col3 | val_3 | val_4 |
我有点坚持如何做到这一点。任何人都可以提供方向吗? 提前致谢
编辑
我的尝试就是这样。但它不是很清楚或有很好的表现。有没有更好的方法呢?
dfChanges = None
#for all column excpet id
for colName in dfA.column[1:]:
#Select whole columns of id and targeted column
#from both datasets and subtract to find differences
changedRows = dfA.select(['IdCol',colName]).subtract(dfB.select(['IdCol',colName]))
#Join with dfB to take the value of targeted column from there
temp = changedRows.join(dfB.select(col('IdCol'),col(colName).alias("dfB_value")),dfA.IdCol == dfB.IdCol, 'inner'). \
drop(dfB.IdCol)
#Proper Rename columns
temp = temp.withColumnRenamed(colname,"dfA_value")
temp = temp.withColumn("Col",lit(colName))
#Append to a single dataframe
if (dfChanges is None):
dfChanges = temp
else:
dfChanges = dfChanges.union(temp)
答案 0 :(得分:2)
通过id加入两个数据框:
dfA = spark.createDataFrame(
[("id1", "val2", "val3")], ("Idcol1", "Col2", "Col3")
)
dfB = spark.createDataFrame(
[("id1", "val2", "val4")], ("Idcol1", "Col2", "Col3")
)
dfAB = dfA.alias("dfA").join(dfB.alias("dfB"), "idCol1")
整形:
from pyspark.sql.functions import col, struct
ids = ["Idcol1"]
vals = [struct(
col("dfA.{}".format(c)).alias("dfA_value"),
col("dfB.{}".format(c)).alias("dfB_value")
).alias(c) for c in dfA.columns if c not in ids]
和melt
(定义为here)
(melt(dfAB.select(ids + vals), ids, [c for c in dfA.columns if c not in ids])
.where(col("value.dfA_value") != col("value.dfB_value"))
.select(ids + ["variable" , "value.dfA_value", "value.dfB_value"])
.show())
+------+--------+---------+---------+
|Idcol1|variable|dfA_value|dfB_value|
+------+--------+---------+---------+
| id1| Col3| val3| val4|
+------+--------+---------+---------+