在比较两个应该相同的行时,识别不匹配的列

时间:2017-09-11 17:05:58

标签: r

我正在比较具有相同信息但由不同人输入的两个数据帧。如果有任何错误,我需要回到物理记录并验证什么是正确的答案。

我的目标是识别具有相同ID的行不匹配的列。然后有一个数据框,它给我行ID和该ID不匹配的列。回到物理文档时,这将使工作变得更加容易。我已经清理了数据,现在只有我知道的行有不一致的地方。如果你想知道我是怎么做的,我使用了这里找到的dupsBetweenGroups函数:http://www.cookbook-r.com/Manipulating_data/Comparing_data_frames/

我在下面举例说明我正在处理的情况:

if

由于我清理数据的方式,两个数据帧都按ID进行组合和组织。尽管如此,每个数据框都有一个df1 <- data.frame(c("T-A1-1", "T-A1-2", "T-A1-3", "T-A1-4"), rep("AAA", 4), c("Yes", "No", "Yes", "No"), c("", "family present", "present", ""), c(NA, NA, "hey", "hey"), as.Date(c("1jan2017", "2jan2017", "31mar2017", "30jul2017"), "%d%b%Y"), c(0, 2, 3, 4)) names(df1) <- c("ID", "Coder", "y/n", "string","NAs", "Dates", "num") ID Coder y/n string NAs Dates num 1 T-A1-1 AAA Yes <NA> 2017-01-01 0 2 T-A1-2 AAA No family present <NA> 2017-01-02 2 3 T-A1-3 AAA Yes present hey 2017-03-31 3 4 T-A1-4 AAA No hey 2017-07-30 4 df2 <- data.frame(c("T-A1-1", "T-A1-2", "T-A1-3", "T-A1-4"), rep("BBB", 4), c("Yes", "Yes", "No", "No"), c("", "family is present", "present", "random"), c(NA, "hey", NA, "hey"), as.Date(c("1jan2017", "3jan2017", "31mar2017", "29jul2017"), "%d%b%Y"), c(1, 2, 5, 6)) names(df2) <- c("ID", "Coder", "y/n", "string","NAs", "Dates", "num") ID Coder y/n string NAs Dates num 1 T-A1-1 BBB Yes <NA> 2017-01-01 1 2 T-A1-2 BBB Yes family is present hey 2017-01-03 2 3 T-A1-3 BBB No present <NA> 2017-03-31 5 4 T-A1-4 BBB No random hey 2017-07-29 6 列,允许我查看最初来自哪一行(如果我需要这样做,这也会使两个数据框更容易分离)。不需要比较coder列。相同的ID行将具有不同的值,因为它们来自两个不同的数据帧。这就是说,我开始使用的数据框看起来更像这样:

coder

理想的情况是获取一个数据框,告诉我行ID和它们不匹配的列。与此类似的东西(注意:我对如何显示结果很灵活,所以它不需要完全像这样):

dfboth <- rbind(df1, df2)
dfboth <- both[with(both, order(ID)), ]

      ID Coder y/n            string  NAs      Dates num
1 T-A1-1   AAA Yes                   <NA> 2017-01-01   0
5 T-A1-1   BBB Yes                   <NA> 2017-01-01   1
2 T-A1-2   AAA  No    family present <NA> 2017-01-02   2
6 T-A1-2   BBB Yes family is present  hey 2017-01-03   2
3 T-A1-3   AAA Yes           present  hey 2017-03-31   3
7 T-A1-3   BBB  No           present <NA> 2017-03-31   5
4 T-A1-4   AAA  No                    hey 2017-07-30   4
8 T-A1-4   BBB  No            random  hey 2017-07-29   6 

我一直在梳理不同的论坛,以寻求如何侮辱这一点,但无济于事。我一直在考虑一个嵌套的results <- data.frame(c("T-A1-1", "T-A1-2", "T-A1-3", "T-A1-4"), c("num", "y/n; string; NAs; Dates", "y/n; NAs, num", "string; Dates; num")) names(results) <- c("ID", "col") ID col 1 T-A1-1 num 2 T-A1-2 y/n; string; NAs; Dates 3 T-A1-3 y/n; NAs, num 4 T-A1-4 string; Dates; num 函数,但是有83列它很快就失控了。任何关于如何解决这个问题的想法都将不胜感激。

if R version 3.4.1

4 个答案:

答案 0 :(得分:2)

使用dplyr传播,延迟,收集和粘贴,你可以实现它。

library(tidyr)
library(dplyr)

results <- dfboth %>%
  gather(key, value, -ID, -Coder) %>%
  group_by(ID, key) %>%
  mutate(next.value = lead(value, order_by=Coder)) %>%
  filter(Coder == "AAA") %>%
  filter(value != next.value | ((is.na(value) + is.na(next.value)) ==1)) %>%
  select(ID, key) %>%
  group_by(ID) %>%
  summarise(col = paste(key, collapse = ";")) %>%
  arrange(ID)


ID    col
T-A1-1  num         
T-A1-2  y/n;string;NAs;Dates            
T-A1-3  y/n;NAs;num         
T-A1-4  string;Dates;num

答案 1 :(得分:1)

我们可以使用两个列表(两个编码器),其中的元素对应于记录ID。然后迭代列表以检测给定记录ID的哪些列不匹配。

library(tidyverse)

# build the lists
df1_list <- split(df1, df1$ID)
df2_list <- split(df2, df1$ID) # using the same factor to split incase one level is not present in both

# a custom function to test two dfs with same ID
columns_mismatched <- function(df1, df2) {

    df <- bind_rows(df1, df2) %>% select(-Coder)
    matches <- map_lgl(df, ~ length(unique(.)) != 1) # logical test if the length of unique values is not equal to 1 (meaning the values are mismatched)
    mis_matches <- matches[matches == TRUE] # keep only mismatches

    # return a tibble or df for easy binding in next step
    return(tibble(bad_cols = names(mis_matches) %>% paste(collapse = ";")))
}

map2_dfr(df1_list, df2_list, # similar to mapply() then do.call(list, rbind)
     ~ columns_mismatched(., .y), .id = "ID") # . represents the elements from df1_list, .y the elemetns of df2_list

# A tibble: 4 x 2
      ID             bad_cols
   <chr>                <chr>
1 T-A1-1                  num
2 T-A1-2 y/n;string;NAs;Dates
3 T-A1-3          y/n;NAs;num
4 T-A1-4     string;Dates;num

答案 2 :(得分:0)

这是通过不同编码器合并数据帧完成的解决方案:

temp = merge(dfboth[dfboth$Coder == "AAA",], dfboth[dfboth$Coder == "BBB",],
             by = "ID", all=TRUE)
# Get names of the columns that need to be checked for matches
cols_to_match = names(dfboth)[3:ncol(dfboth)]

# Convert NA to character to allow check of matching NA values
temp$NAs.x[is.na(temp$NAs.x)] = "<N/A>"
temp$NAs.y[is.na(temp$NAs.y)] = "<N/A>"


# Get matches, TRUE if match, FALSE if not match
results = data.frame(ID = temp$ID,
           temp[,c(paste0(cols_to_match,".x"))] == temp[,c(paste0(cols_to_match,".y"))])
names(results)[2:ncol(results)] = cols_to_match

# Column to indicate whether all fields match
results$all_match = apply(results[2:ncol(results)], 1, all)

results$col = apply(results[,2:6], 1, function(x){ paste0(cols_to_match[!unlist(x)], collapse="; ")})

答案 3 :(得分:0)

只是为了变化,这是一个data.table解决方案。第一个结果是一种比人类阅读更有助于进一步处理的形式。虽然,它仍然很可读。

已修改:现在将NA视为匹配NA

library(data.table)

setDT(dfboth)

dfboth[
  ,
  {
    is_different <- vapply(
      .SD,
      function(x) !identical(x[1], x[2]),
      logical(1)
    )
    list(mismatch = names(.SD)[is_different])
  },
  by = "ID"
][
  mismatch != "Coder"
]
#         ID mismatch
#  1: T-A1-1      num
#  2: T-A1-2      y/n
#  3: T-A1-2   string
#  4: T-A1-2      NAs
#  5: T-A1-2    Dates
#  6: T-A1-3      y/n
#  7: T-A1-3      NAs
#  8: T-A1-3      num
#  9: T-A1-4   string
# 10: T-A1-4    Dates
# 11: T-A1-4      num

如果您需要缩写形式,只需在第二个子集中添加一些额外的格式代码。

dfboth[
  ,
  {
    is_different <- vapply(
      .SD,
      function(x) !identical(x[1], x[2]),
      logical(1)
    )
    list(mismatch = names(.SD)[is_different])
  },
  by = "ID"
][
  mismatch != "Coder",
  list(col = paste0(mismatch, collapse = "; ")),
  by = "ID"
]
#        ID                     col
# 1: T-A1-1                     num
# 2: T-A1-2 y/n; string; NAs; Dates
# 3: T-A1-3           y/n; NAs; num
# 4: T-A1-4      string; Dates; num