比较R中两个具有不同行数的数据帧,以查找单元格中的差异

时间:2019-07-29 21:08:25

标签: r dataframe dplyr

所以我有两个不同年份的数据框。第二年df在某些单元格中具有其他行以及更新的值。我的目标是获得一个新的数据框,该数据框仅显示已添加或更改的内容,其他所有内容都可以为0,NA或删除。

看看df 1(y1):

project_ID  sequence  item         q1    q2    q3   q4
NA          NA        NA           NA    NA    NA   NA
NA          207       period       201h  202h  203h 204h     
NA          222       prepayment   1202  202.3 99   2922
2455        271       prepayment_2 1000  1000  1000 1000
2929        780       UPS          50    51    52   53
NA          NA        NA           NA    NA    NA   NA

所以现在过去了一年,我有了这个新数据集(y2),请注意不同的行和某些更改的值。

project_ID  sequence  item         q1    q2    q3   q4
NA          NA        NA           NA    NA    NA   NA
NA          207       period       201h  202h  203h 204h     
NA          222       prepayment   1202  202.3 99   2922
2455        271       prepayment_2 999   999   1002 1000
3002        299       payment      500   500   500  500
2929        780       UPS          50    51    52   53 
NA          NA        NA           NA    NA    NA   NA

所以我试图在library(compare)中使用compare()函数,但是据我所知,它没有我要寻找的功能。

cmp<- compare(df1,df2)
cmp$tM

但是这并不能真正帮到我,特别是因为行数不同。 此外,它只会告诉我哪些不同而无需计算差异。

所以我想看到的是一个新的数据框,看起来像这样:

project_ID  sequence  item         q1    q2    q3   q4



2455        271       prepayment_2 -1    -1     2   
3002        299       payment      500   500   500  500


现在,这是我能想到的最好方法,但从根本上讲,我只需要一个新的df,其中包含已更改的内容和已更改的值的区别;间距并不是那么重要,如果更容易以不同的方式进行布置,我会很高兴。

编辑: 这是R的两个df。

y1<- structure(list(project_ID = c("NA", "NA", "NA", "2455", "2929", 
"NA"), sequence = c("NA", "207", "222", "271", "780", "NA"), 
    item = c("NA", "period", "prepayment", "prepayment_2", "UPS", 
    "NA"), q1 = c("NA", "201h", "1202", "1000", "50", "NA"), 
    q2 = c("NA", "202h", "202.3", "1000", "51", "NA"), q3 = c("NA", 
    "203h", "99", "1000", "52", "NA"), q4 = c("NA", "204h", "2922", 
    "1000", "53", "NA")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))


y2 <- structure(list(project_ID = c("NA", "NA", "NA", "2455", "3002", 
"2929", "NA"), sequence = c("NA", "207", "222", "271", "299", 
"780", "NA"), item = c("NA", "period", "prepayment", "prepayment_2", 
"payment", "UPS", "NA"), q1 = c("NA", "201h", "1202", "999", 
"500", "50", "NA"), q2 = c("NA", "202h", "202.3", "999", "500", 
"51", "NA"), q3 = c("NA", "203h", "99", "1002", "500", "52", "NA"
), q4 = c("NA", "204h", "2922", "1000", "500", "53", "NA")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

1 个答案:

答案 0 :(得分:3)

如建议的那样,*_join函数系列非常有用,而且还有从宽到长到再到宽到宽的重塑。

注意:我假设任何看起来像是数字的都是数字,将"201h"转换为201。 (如果不正确,请更新您的示例数据。)

library(dplyr)
library(tidyr)
full_join(
  gather(y1, q, val1, -project_ID, -sequence, -item) %>% mutate(in1 = TRUE),
  gather(y2, q, val2, -project_ID, -sequence, -item) %>% mutate(in2 = TRUE),
  by = c("project_ID", "sequence", "item", "q")
) %>%
  # mutate_at(vars(val1, val2), ~ as.numeric(gsub("[^.[:digit:]]", "", .))) %>%
  mutate_at(vars(val1, val2), ~ suppressWarnings(as.numeric(.))) %>%
  mutate(
    # valdiff = val2 - val1
    valdiff = case_when(
      is.na(val1) ~ val2,
      is.na(val2) ~ val1,
      TRUE ~ val2 - val1
    )
  ) %>%
  select(-val1, -val2) %>%
  distinct() %>%
  spread(q, valdiff)
# # A tibble: 6 x 9
#   project_ID sequence item         in1   in2      q1    q2    q3    q4
#   <chr>      <chr>    <chr>        <lgl> <lgl> <dbl> <dbl> <dbl> <dbl>
# 1 2455       271      prepayment_2 TRUE  TRUE     -1    -1  -898     0
# 2 2929       780      UPS          TRUE  TRUE      0     0     0     0
# 3 3002       299      payment      NA    TRUE    500   500   500   500
# 4 NA         207      period       TRUE  TRUE     NA    NA    NA    NA
# 5 NA         222      prepayment   TRUE  TRUE      0     0     0     0
# 6 NA         NA       NA           TRUE  TRUE     NA    NA    NA    NA

(我假设我的输出与预期输出的差异是由于数据中的复制/粘贴问题引起的,也许102中的y21002?)