确定R中两个数据集之间的特定差异

时间:2014-12-11 18:09:20

标签: r

我想比较两个数据集并确定它们之间的差异的具体实例(即哪些变量不同)。

虽然我已经找到了如何识别两个数据集之间哪些记录不相同(使用此处详述的函数:http://www.cookbook-r.com/Manipulating_data/Comparing_data_frames/),但我不确定如何标记哪些变量不同。

E.g。

数据集A:

id      name        dob       vaccinedate  vaccinename  dose
100000  John Doe    1/1/2000  5/20/2012    MMR          4
100001  Jane Doe    7/3/2011  3/14/2013    VARICELLA    1

数据集B:

id      name        dob       vaccinedate  vaccinename  dose
100000  John Doe    1/1/2000  5/20/2012    MMR          3
100001  Jane Doee   7/3/2011  3/24/2013    VARICELLA    1
100002  John Smith  2/5/2010  7/13/2013    HEPB         3

我想确定哪些记录不同,哪些特定变量有差异。例如,John Doe记录在dose中有1个差异,Jane Doe记录有2个差异:namevaccinedate。此外,数据集B还有一个不在数据集A中的附加记录,我也希望识别这些实例。

最后,目标是找出错误“类型”的频率,例如:有多少记录在疫苗接种,疫苗名称,剂量等方面存在差异。

谢谢!

4 个答案:

答案 0 :(得分:7)

这应该让你开始,但可能有更优雅的解决方案。

首先,建立df1df2,以便其他人可以快速复制:

df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))

df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))

接下来,通过df1df2获取mapplysetdiff之间的差异。也就是说,第一组中的内容不在第二组中:

discrep <- mapply(setdiff, df1, df2)
discrep
# $id
# integer(0)
# 
# $name
# [1] "Jane Doe"
# 
# $dob
# character(0)
# 
# $vaccinedate
# [1] "3/14/2013"
# 
# $vaccinename
# character(0)
# 
# $dose
# [1] 4

要计算它们,我们可以使用sapply

num.discrep <- sapply(discrep, length)
num.discrep
# id        name         dob vaccinedate vaccinename        dose 
# 0           1           0           1           0           1 

根据你在第二组中获得不在第一组中的ID的问题,你可以用mapply(setdiff, df2, df1)来反转这个过程,或者如果它只是ids的练习,那么你只能做setdiff(df2$id, df1$id) }。

有关R的功能函数(例如,mapply,sapply,lapply等)的更多信息,请参阅this post


使用purrr解决方案进行更新:

map2(df1, df2, setdiff) %>% 
  map_int(length)

答案 1 :(得分:2)

一种可能性。首先,找出两个数据集有哪些共同点。最简单的方法是:

commonID<-intersect(A$id,B$id)

然后,您可以通过以下方式确定A中缺少哪些行:

> B[!B$id %in% commonID,]
#       id       name      dob vaccinedate vaccinename dose
# 3 100002 John Smith 2/5/2010   7/13/2013        HEPB    3

接下来,您可以将两个数据集限制为它们共有的ID。

Acommon<-A[A$id %in% commonID,]
Bcommon<-B[B$id %in% commonID,]

如果您不能假设ID的顺序正确,那么请对它们进行排序:

Acommon<-Acommon[order(Acommon$id),]
Bcommon<-Bcommon[order(Bcommon$id),]

现在你可以看到这样的字段有什么不同。

diffs<-Acommon != Bcommon
diffs
#      id  name   dob vaccinedate vaccinename  dose
# 1 FALSE FALSE FALSE       FALSE       FALSE  TRUE
# 2 FALSE  TRUE FALSE        TRUE       FALSE FALSE

这是一个逻辑矩阵,您可以随心所欲地做任何事情。例如,要查找每列中的错误总数:

colSums(diffs)
#         id        name         dob vaccinedate vaccinename        dose 
#          0           1           0           1           0           1 

要查找名称不同的所有ID:

Acommon$id[diffs[,"name"]]
# [1] 100001

等等。

答案 2 :(得分:0)

有一个新的包调用 waldo

install.packages("waldo")
library(waldo)

# construct the data frames


df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))

df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))

# compare them
compare(df1,df2)

我们得到:

`old` is length 2
`new` is length 3

`names(old)`: "X" "Y"    
`names(new)`: "X" "Y" "Z"

`attr(old, 'row.names')`: 1 2 3  
`attr(new, 'row.names')`: 1 2 3 4

`old$X`: 1 2 3  
`new$X`: 1 2 3 4

`old$Y`: "a" "b" "c"    
`new$Y`: "A" "b" "c" "d"

`old$Z` is absent
`new$Z` is a character vector ('k', 'l', 'm', 'n')

答案 3 :(得分:-1)

library(compareDF)

compare_df(dataframe1, dataframe2, c("columnname"))