我有两个数据框,df.old
和df.new
。 df.old
包含df.new
不能容纳的额外列。我想将df.new
中的每个单元格与ID_KEY
中与同一行(同一df.old
)列相关的单元格进行比较。然后,我想创建一个单独的数据框,该数据框是给出ID_KEY
,字段名以及旧值和新值的所有差异的报告。例如:
df.old:
ID_KEY | Date of Valuation | Original LTV | Tenure | Valuation in Current Condition | Comment
1 22/02/2016 76% Leasehold £151,000
2 22/02/2016 75% Leasehold £151,000
3 23/02/2016 76% Leasehold £150,000
4 24/02/2016 76% Freehold £151,000
df.new:
ID_KEY | Date of Valuation | Original LTV | Tenure | Valuation in Current Condition
1 21/02/2016 76% Leasehold £151,000
2 22/02/2016 73% Leasehold £151,000
3 23/02/2016 76% Leasehold £153,000
4 24/02/2016 76% Leasehold £151,000
报告:
ID_KEY | Fieldname | df.old_value | df.new_value
1 Date of Valuation 22/02/2016 21/02/2016
2 Original LTV 75% 73%
3 Valuation in Current Condition £150,000 £153,000
4 Tenure Freehold Leasehold
我可以设法在VBA中写这个,但我的R代码有点生疏。我知道有一种更简单的方法可以使用split-apply-combine在R中编写它,但是我无法搞清楚它。
答案 0 :(得分:1)
选项1:一种可能的方法是将melt
数据帧转换为长格式,合并它们然后根据不匹配的值过滤它们:
library(reshape2)
df.old2 <- melt(df.old, id.vars = "ID_KEY", value.name = "df.old_value")
df.new2 <- melt(df.new, id.vars = "ID_KEY", value.name = "df.new_value")
df.merged <- merge(df.old2, df.new2, by = c("ID_KEY","variable"))
df.merged[df.merged$df.old_value!=df.merged$df.new_value,]
给出:
ID_KEY variable df.old_value df.new_value
1 1 Date.of.Valuation 22/02/2016 21/02/2016
6 2 Original.LTV 75% 73%
12 3 Valuation.Current.Condition £150,000 £153,000
15 4 Tenure Freehold Leasehold
选项2 另一种方法是首先将merge
数据帧放在一起,然后melt
将结果转换为长格式,并使用<{1}}函数从< em> data.table 包,它能够根据模式在输出中给出多个值列:
melt
给出:
# create a vector with the common fieldnames
fnames <- names(df.new)[-1]
# or:
fnames <- names(df.old)[names(df.old) %in% names(df.new)][-1]
# merge the dataframes together based on "ID_KEY"
df1 <- merge(df.old, df.new, by = "ID_KEY")
# melt 'df1' into long format and check where the two value columns don't match
library(data.table)
melt(setDT(df1), "ID_KEY",
measure.vars = patterns(".x",".y"),
variable.name = "fieldname",
value.name = c("df.old_value","df.new_value"))[, fieldname := fnames[fieldname]
][df.old_value!=df.new_value][]
注意:我使用的数据在 ID_KEY fieldname df.old_value df.new_value
1: 1 Date.of.Valuation 22/02/2016 21/02/2016
2: 2 Original.LTV 75% 73%
3: 4 Tenure Freehold Leasehold
4: 3 Valuation.Current.Condition £150,000 £153,000
中也有不匹配的ID_KEY
:
df.old
更新新示例数据:
将方法应用于 reshape2 :
df.old <- read.table(text="ID_KEY Date.of.Valuation Original.LTV Tenure Valuation.Current.Condition Comment
1 22/02/2016 76% Leasehold £151,000 Comment
2 22/02/2016 75% Leasehold £151,000 Comment
3 23/02/2016 76% Leasehold £150,000 Comment
4 24/02/2016 76% Freehold £151,000 Comment
5 24/02/2016 76% Freehold £151,000 Comment", header=TRUE)
df.new <- read.table(text="ID_KEY Date.of.Valuation Original.LTV Tenure Valuation.Current.Condition
1 21/02/2016 76% Leasehold £151,000
2 22/02/2016 73% Leasehold £151,000
3 23/02/2016 76% Leasehold £153,000
4 24/02/2016 76% Leasehold £151,000", header=TRUE)
给出:
df.old2 <- melt(df.old, id.vars = "Loan Identifier", value.name = "df.old_value")
df.new2 <- melt(df.new, id.vars = "Loan Identifier", value.name = "df.new_value")
df.m <- merge(df.old2, df.new2, by = c("Loan Identifier","variable"))
df.r <- df.m[which(df.m$df.old_value!=df.m$df.new_value),]
使用 data.table ,第一个示例数据集上使用的方法不起作用。类似于 reshape2 方法的工作解决方案:
> head(df.r)
Loan Identifier variable df.old_value df.new_value
1 960959610 Advance Amount (Gross Advance) 172499 166000
8 960959610 Completion date 1446422400 1447286400
11 960959610 Income B1 22800 47211
12 960959610 Income B2 22000 19461
13 960959610 Interest Rate 0.0309 0.0409
21 960959610 Original Term 420 240
给出:
# making copies, not necessarily needed
df.o <- as.data.table(df.old)
df.n <- as.data.table(df.new)
df.o2 <- melt(df.o, id.vars = "Loan Identifier", value.name = "df.old_value")
df.n2 <- melt(df.n, id.vars = "Loan Identifier", value.name = "df.new_value")
df.j <- df.n2[df.o2, on = c("Loan Identifier","variable")
][df.j$df.old_value!=df.j$df.new_value]