所以我有两个不同年份的数据框。第二年df在某些单元格中具有其他行以及更新的值。我的目标是获得一个新的数据框,该数据框仅显示已添加或更改的内容,其他所有内容都可以为0,NA或删除。
看看df 1(y1):
project_ID sequence item q1 q2 q3 q4
NA NA NA NA NA NA NA
NA 207 period 201h 202h 203h 204h
NA 222 prepayment 1202 202.3 99 2922
2455 271 prepayment_2 1000 1000 1000 1000
2929 780 UPS 50 51 52 53
NA NA NA NA NA NA NA
所以现在过去了一年,我有了这个新数据集(y2),请注意不同的行和某些更改的值。
project_ID sequence item q1 q2 q3 q4
NA NA NA NA NA NA NA
NA 207 period 201h 202h 203h 204h
NA 222 prepayment 1202 202.3 99 2922
2455 271 prepayment_2 999 999 1002 1000
3002 299 payment 500 500 500 500
2929 780 UPS 50 51 52 53
NA NA NA NA NA NA NA
所以我试图在library(compare)中使用compare()函数,但是据我所知,它没有我要寻找的功能。
cmp<- compare(df1,df2)
cmp$tM
但是这并不能真正帮到我,特别是因为行数不同。 此外,它只会告诉我哪些不同而无需计算差异。
所以我想看到的是一个新的数据框,看起来像这样:
project_ID sequence item q1 q2 q3 q4
2455 271 prepayment_2 -1 -1 2
3002 299 payment 500 500 500 500
现在,这是我能想到的最好方法,但从根本上讲,我只需要一个新的df,其中包含已更改的内容和已更改的值的区别;间距并不是那么重要,如果更容易以不同的方式进行布置,我会很高兴。
编辑: 这是R的两个df。
y1<- structure(list(project_ID = c("NA", "NA", "NA", "2455", "2929",
"NA"), sequence = c("NA", "207", "222", "271", "780", "NA"),
item = c("NA", "period", "prepayment", "prepayment_2", "UPS",
"NA"), q1 = c("NA", "201h", "1202", "1000", "50", "NA"),
q2 = c("NA", "202h", "202.3", "1000", "51", "NA"), q3 = c("NA",
"203h", "99", "1000", "52", "NA"), q4 = c("NA", "204h", "2922",
"1000", "53", "NA")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
y2 <- structure(list(project_ID = c("NA", "NA", "NA", "2455", "3002",
"2929", "NA"), sequence = c("NA", "207", "222", "271", "299",
"780", "NA"), item = c("NA", "period", "prepayment", "prepayment_2",
"payment", "UPS", "NA"), q1 = c("NA", "201h", "1202", "999",
"500", "50", "NA"), q2 = c("NA", "202h", "202.3", "999", "500",
"51", "NA"), q3 = c("NA", "203h", "99", "1002", "500", "52", "NA"
), q4 = c("NA", "204h", "2922", "1000", "500", "53", "NA")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
答案 0 :(得分:3)
如建议的那样,*_join
函数系列非常有用,而且还有从宽到长到再到宽到宽的重塑。
注意:我假设任何看起来像是数字的都是数字,将"201h"
转换为201
。 (如果不正确,请更新您的示例数据。)
library(dplyr)
library(tidyr)
full_join(
gather(y1, q, val1, -project_ID, -sequence, -item) %>% mutate(in1 = TRUE),
gather(y2, q, val2, -project_ID, -sequence, -item) %>% mutate(in2 = TRUE),
by = c("project_ID", "sequence", "item", "q")
) %>%
# mutate_at(vars(val1, val2), ~ as.numeric(gsub("[^.[:digit:]]", "", .))) %>%
mutate_at(vars(val1, val2), ~ suppressWarnings(as.numeric(.))) %>%
mutate(
# valdiff = val2 - val1
valdiff = case_when(
is.na(val1) ~ val2,
is.na(val2) ~ val1,
TRUE ~ val2 - val1
)
) %>%
select(-val1, -val2) %>%
distinct() %>%
spread(q, valdiff)
# # A tibble: 6 x 9
# project_ID sequence item in1 in2 q1 q2 q3 q4
# <chr> <chr> <chr> <lgl> <lgl> <dbl> <dbl> <dbl> <dbl>
# 1 2455 271 prepayment_2 TRUE TRUE -1 -1 -898 0
# 2 2929 780 UPS TRUE TRUE 0 0 0 0
# 3 3002 299 payment NA TRUE 500 500 500 500
# 4 NA 207 period TRUE TRUE NA NA NA NA
# 5 NA 222 prepayment TRUE TRUE 0 0 0 0
# 6 NA NA NA TRUE TRUE NA NA NA NA
(我假设我的输出与预期输出的差异是由于数据中的复制/粘贴问题引起的,也许102
中的y2
是1002
?)