比较两个数据帧并创建一个报告,给出有差异的字段名称和旧/新值

时间:2016-02-22 14:02:45

标签: r dataframe compare

我有两个数据框,df.olddf.newdf.old包含df.new不能容纳的额外列。我想将df.new中的每个单元格与ID_KEY中与同一行(同一df.old)列相关的单元格进行比较。然后,我想创建一个单独的数据框,该数据框是给出ID_KEY,字段名以及旧值和新值的所有差异的报告。例如:

df.old:
ID_KEY | Date of Valuation | Original LTV | Tenure | Valuation in Current Condition | Comment
1       22/02/2016         76%     Leasehold     £151,000
2       22/02/2016         75%     Leasehold     £151,000
3       23/02/2016         76%     Leasehold     £150,000
4       24/02/2016         76%     Freehold     £151,000

df.new:
ID_KEY | Date of Valuation | Original LTV | Tenure | Valuation in Current Condition
1       21/02/2016         76%     Leasehold     £151,000
2       22/02/2016         73%     Leasehold     £151,000
3       23/02/2016         76%     Leasehold     £153,000
4       24/02/2016         76%     Leasehold     £151,000

报告:

ID_KEY | Fieldname        |              df.old_value | df.new_value
1      Date of Valuation              22/02/2016   21/02/2016
2      Original LTV                   75%          73%
3      Valuation in Current Condition £150,000     £153,000
4      Tenure                         Freehold     Leasehold

我可以设法在VBA中写这个,但我的R代码有点生疏。我知道有一种更简单的方法可以使用split-apply-combine在R中编写它,但是我无法搞清楚它。

1 个答案:

答案 0 :(得分:1)

选项1:一种可能的方法是将melt数据帧转换为长格式,合并它们然后根据不匹配的值过滤它们:

library(reshape2)

df.old2 <- melt(df.old, id.vars = "ID_KEY", value.name = "df.old_value")
df.new2 <- melt(df.new, id.vars = "ID_KEY", value.name = "df.new_value")

df.merged <- merge(df.old2, df.new2, by = c("ID_KEY","variable"))
df.merged[df.merged$df.old_value!=df.merged$df.new_value,]

给出:

   ID_KEY                    variable df.old_value df.new_value
1       1           Date.of.Valuation   22/02/2016   21/02/2016
6       2                Original.LTV          75%          73%
12      3 Valuation.Current.Condition     £150,000     £153,000
15      4                      Tenure     Freehold    Leasehold

选项2 另一种方法是首先将merge数据帧放在一起,然后melt将结果转换为长格式,并使用<{1}}函数从< em> data.table 包,它能够根据模式在输出中给出多个值列:

melt

给出:

# create a vector with the common fieldnames
fnames <- names(df.new)[-1]
# or:
fnames <- names(df.old)[names(df.old) %in% names(df.new)][-1]

# merge the dataframes together based on "ID_KEY"
df1 <- merge(df.old, df.new, by = "ID_KEY")

# melt 'df1' into long format and check where the two value columns don't match
library(data.table)
melt(setDT(df1), "ID_KEY", 
     measure.vars = patterns(".x",".y"),
     variable.name = "fieldname",
     value.name = c("df.old_value","df.new_value"))[, fieldname := fnames[fieldname]
                                                    ][df.old_value!=df.new_value][]

注意:我使用的数据在 ID_KEY fieldname df.old_value df.new_value 1: 1 Date.of.Valuation 22/02/2016 21/02/2016 2: 2 Original.LTV 75% 73% 3: 4 Tenure Freehold Leasehold 4: 3 Valuation.Current.Condition £150,000 £153,000 中也有不匹配的ID_KEY

df.old

更新新示例数据:

将方法应用于 reshape2

df.old <- read.table(text="ID_KEY  Date.of.Valuation  Original.LTV  Tenure  Valuation.Current.Condition  Comment
1       22/02/2016         76%     Leasehold     £151,000  Comment
2       22/02/2016         75%     Leasehold     £151,000  Comment
3       23/02/2016         76%     Leasehold     £150,000  Comment
4       24/02/2016         76%     Freehold     £151,000  Comment
5       24/02/2016         76%     Freehold     £151,000  Comment", header=TRUE)

df.new <- read.table(text="ID_KEY  Date.of.Valuation  Original.LTV  Tenure  Valuation.Current.Condition
1       21/02/2016         76%     Leasehold     £151,000
2       22/02/2016         73%     Leasehold     £151,000
3       23/02/2016         76%     Leasehold     £153,000
4       24/02/2016         76%     Leasehold     £151,000", header=TRUE)

给出:

df.old2 <- melt(df.old, id.vars = "Loan Identifier", value.name = "df.old_value")
df.new2 <- melt(df.new, id.vars = "Loan Identifier", value.name = "df.new_value")

df.m <- merge(df.old2, df.new2, by = c("Loan Identifier","variable"))
df.r <- df.m[which(df.m$df.old_value!=df.m$df.new_value),]

使用 data.table ,第一个示例数据集上使用的方法不起作用。类似于 reshape2 方法的工作解决方案:

> head(df.r)
   Loan Identifier                       variable df.old_value df.new_value
1        960959610 Advance Amount (Gross Advance)       172499       166000
8        960959610                Completion date   1446422400   1447286400
11       960959610                      Income B1        22800        47211
12       960959610                      Income B2        22000        19461
13       960959610                  Interest Rate       0.0309       0.0409
21       960959610                  Original Term          420          240

给出:

# making copies, not necessarily needed
df.o <- as.data.table(df.old)
df.n <- as.data.table(df.new)

df.o2 <- melt(df.o, id.vars = "Loan Identifier", value.name = "df.old_value")
df.n2 <- melt(df.n, id.vars = "Loan Identifier", value.name = "df.new_value")

df.j <- df.n2[df.o2, on = c("Loan Identifier","variable")
              ][df.j$df.old_value!=df.j$df.new_value]