我正在比较具有相同信息但由不同人输入的两个数据帧。如果有任何错误,我需要回到物理记录并验证什么是正确的答案。
我的目标是识别具有相同ID的行不匹配的列。然后有一个数据框,它给我行ID和该ID不匹配的列。回到物理文档时,这将使工作变得更加容易。我已经清理了数据,现在只有我知道的行有不一致的地方。如果你想知道我是怎么做的,我使用了这里找到的dupsBetweenGroups函数:http://www.cookbook-r.com/Manipulating_data/Comparing_data_frames/
我在下面举例说明我正在处理的情况:
if
由于我清理数据的方式,两个数据帧都按ID进行组合和组织。尽管如此,每个数据框都有一个df1 <- data.frame(c("T-A1-1", "T-A1-2", "T-A1-3", "T-A1-4"), rep("AAA", 4), c("Yes", "No", "Yes", "No"), c("", "family present", "present", ""), c(NA, NA, "hey", "hey"), as.Date(c("1jan2017", "2jan2017", "31mar2017", "30jul2017"), "%d%b%Y"), c(0, 2, 3, 4))
names(df1) <- c("ID", "Coder", "y/n", "string","NAs", "Dates", "num")
ID Coder y/n string NAs Dates num
1 T-A1-1 AAA Yes <NA> 2017-01-01 0
2 T-A1-2 AAA No family present <NA> 2017-01-02 2
3 T-A1-3 AAA Yes present hey 2017-03-31 3
4 T-A1-4 AAA No hey 2017-07-30 4
df2 <- data.frame(c("T-A1-1", "T-A1-2", "T-A1-3", "T-A1-4"), rep("BBB", 4), c("Yes", "Yes", "No", "No"), c("", "family is present", "present", "random"), c(NA, "hey", NA, "hey"), as.Date(c("1jan2017", "3jan2017", "31mar2017", "29jul2017"), "%d%b%Y"), c(1, 2, 5, 6))
names(df2) <- c("ID", "Coder", "y/n", "string","NAs", "Dates", "num")
ID Coder y/n string NAs Dates num
1 T-A1-1 BBB Yes <NA> 2017-01-01 1
2 T-A1-2 BBB Yes family is present hey 2017-01-03 2
3 T-A1-3 BBB No present <NA> 2017-03-31 5
4 T-A1-4 BBB No random hey 2017-07-29 6
列,允许我查看最初来自哪一行(如果我需要这样做,这也会使两个数据框更容易分离)。不需要比较coder
列。相同的ID行将具有不同的值,因为它们来自两个不同的数据帧。这就是说,我开始使用的数据框看起来更像这样:
coder
理想的情况是获取一个数据框,告诉我行ID和它们不匹配的列。与此类似的东西(注意:我对如何显示结果很灵活,所以它不需要完全像这样):
dfboth <- rbind(df1, df2)
dfboth <- both[with(both, order(ID)), ]
ID Coder y/n string NAs Dates num
1 T-A1-1 AAA Yes <NA> 2017-01-01 0
5 T-A1-1 BBB Yes <NA> 2017-01-01 1
2 T-A1-2 AAA No family present <NA> 2017-01-02 2
6 T-A1-2 BBB Yes family is present hey 2017-01-03 2
3 T-A1-3 AAA Yes present hey 2017-03-31 3
7 T-A1-3 BBB No present <NA> 2017-03-31 5
4 T-A1-4 AAA No hey 2017-07-30 4
8 T-A1-4 BBB No random hey 2017-07-29 6
我一直在梳理不同的论坛,以寻求如何侮辱这一点,但无济于事。我一直在考虑一个嵌套的results <- data.frame(c("T-A1-1", "T-A1-2", "T-A1-3", "T-A1-4"), c("num", "y/n; string; NAs; Dates", "y/n; NAs, num", "string; Dates; num"))
names(results) <- c("ID", "col")
ID col
1 T-A1-1 num
2 T-A1-2 y/n; string; NAs; Dates
3 T-A1-3 y/n; NAs, num
4 T-A1-4 string; Dates; num
函数,但是有83列它很快就失控了。任何关于如何解决这个问题的想法都将不胜感激。
if
R version 3.4.1
答案 0 :(得分:2)
使用dplyr传播,延迟,收集和粘贴,你可以实现它。
library(tidyr)
library(dplyr)
results <- dfboth %>%
gather(key, value, -ID, -Coder) %>%
group_by(ID, key) %>%
mutate(next.value = lead(value, order_by=Coder)) %>%
filter(Coder == "AAA") %>%
filter(value != next.value | ((is.na(value) + is.na(next.value)) ==1)) %>%
select(ID, key) %>%
group_by(ID) %>%
summarise(col = paste(key, collapse = ";")) %>%
arrange(ID)
ID col
T-A1-1 num
T-A1-2 y/n;string;NAs;Dates
T-A1-3 y/n;NAs;num
T-A1-4 string;Dates;num
答案 1 :(得分:1)
我们可以使用两个列表(两个编码器),其中的元素对应于记录ID。然后迭代列表以检测给定记录ID的哪些列不匹配。
library(tidyverse)
# build the lists
df1_list <- split(df1, df1$ID)
df2_list <- split(df2, df1$ID) # using the same factor to split incase one level is not present in both
# a custom function to test two dfs with same ID
columns_mismatched <- function(df1, df2) {
df <- bind_rows(df1, df2) %>% select(-Coder)
matches <- map_lgl(df, ~ length(unique(.)) != 1) # logical test if the length of unique values is not equal to 1 (meaning the values are mismatched)
mis_matches <- matches[matches == TRUE] # keep only mismatches
# return a tibble or df for easy binding in next step
return(tibble(bad_cols = names(mis_matches) %>% paste(collapse = ";")))
}
map2_dfr(df1_list, df2_list, # similar to mapply() then do.call(list, rbind)
~ columns_mismatched(., .y), .id = "ID") # . represents the elements from df1_list, .y the elemetns of df2_list
# A tibble: 4 x 2
ID bad_cols
<chr> <chr>
1 T-A1-1 num
2 T-A1-2 y/n;string;NAs;Dates
3 T-A1-3 y/n;NAs;num
4 T-A1-4 string;Dates;num
答案 2 :(得分:0)
这是通过不同编码器合并数据帧完成的解决方案:
temp = merge(dfboth[dfboth$Coder == "AAA",], dfboth[dfboth$Coder == "BBB",],
by = "ID", all=TRUE)
# Get names of the columns that need to be checked for matches
cols_to_match = names(dfboth)[3:ncol(dfboth)]
# Convert NA to character to allow check of matching NA values
temp$NAs.x[is.na(temp$NAs.x)] = "<N/A>"
temp$NAs.y[is.na(temp$NAs.y)] = "<N/A>"
# Get matches, TRUE if match, FALSE if not match
results = data.frame(ID = temp$ID,
temp[,c(paste0(cols_to_match,".x"))] == temp[,c(paste0(cols_to_match,".y"))])
names(results)[2:ncol(results)] = cols_to_match
# Column to indicate whether all fields match
results$all_match = apply(results[2:ncol(results)], 1, all)
results$col = apply(results[,2:6], 1, function(x){ paste0(cols_to_match[!unlist(x)], collapse="; ")})
答案 3 :(得分:0)
只是为了变化,这是一个data.table
解决方案。第一个结果是一种比人类阅读更有助于进一步处理的形式。虽然,它仍然很可读。
已修改:现在将NA
视为匹配NA
library(data.table)
setDT(dfboth)
dfboth[
,
{
is_different <- vapply(
.SD,
function(x) !identical(x[1], x[2]),
logical(1)
)
list(mismatch = names(.SD)[is_different])
},
by = "ID"
][
mismatch != "Coder"
]
# ID mismatch
# 1: T-A1-1 num
# 2: T-A1-2 y/n
# 3: T-A1-2 string
# 4: T-A1-2 NAs
# 5: T-A1-2 Dates
# 6: T-A1-3 y/n
# 7: T-A1-3 NAs
# 8: T-A1-3 num
# 9: T-A1-4 string
# 10: T-A1-4 Dates
# 11: T-A1-4 num
如果您需要缩写形式,只需在第二个子集中添加一些额外的格式代码。
dfboth[
,
{
is_different <- vapply(
.SD,
function(x) !identical(x[1], x[2]),
logical(1)
)
list(mismatch = names(.SD)[is_different])
},
by = "ID"
][
mismatch != "Coder",
list(col = paste0(mismatch, collapse = "; ")),
by = "ID"
]
# ID col
# 1: T-A1-1 num
# 2: T-A1-2 y/n; string; NAs; Dates
# 3: T-A1-3 y/n; NAs; num
# 4: T-A1-4 string; Dates; num