我想比较两个数据集并确定它们之间的差异的具体实例(即哪些变量不同)。
虽然我已经找到了如何识别两个数据集之间哪些记录不相同(使用此处详述的函数:http://www.cookbook-r.com/Manipulating_data/Comparing_data_frames/),但我不确定如何标记哪些变量不同。
E.g。
数据集A:
id name dob vaccinedate vaccinename dose
100000 John Doe 1/1/2000 5/20/2012 MMR 4
100001 Jane Doe 7/3/2011 3/14/2013 VARICELLA 1
数据集B:
id name dob vaccinedate vaccinename dose
100000 John Doe 1/1/2000 5/20/2012 MMR 3
100001 Jane Doee 7/3/2011 3/24/2013 VARICELLA 1
100002 John Smith 2/5/2010 7/13/2013 HEPB 3
我想确定哪些记录不同,哪些特定变量有差异。例如,John Doe记录在dose
中有1个差异,Jane Doe记录有2个差异:name
和vaccinedate
。此外,数据集B还有一个不在数据集A中的附加记录,我也希望识别这些实例。
最后,目标是找出错误“类型”的频率,例如:有多少记录在疫苗接种,疫苗名称,剂量等方面存在差异。
谢谢!
答案 0 :(得分:7)
这应该让你开始,但可能有更优雅的解决方案。
首先,建立df1
和df2
,以便其他人可以快速复制:
df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
接下来,通过df1
和df2
获取mapply
到setdiff
之间的差异。也就是说,第一组中的内容不在第二组中:
discrep <- mapply(setdiff, df1, df2)
discrep
# $id
# integer(0)
#
# $name
# [1] "Jane Doe"
#
# $dob
# character(0)
#
# $vaccinedate
# [1] "3/14/2013"
#
# $vaccinename
# character(0)
#
# $dose
# [1] 4
要计算它们,我们可以使用sapply
:
num.discrep <- sapply(discrep, length)
num.discrep
# id name dob vaccinedate vaccinename dose
# 0 1 0 1 0 1
根据你在第二组中获得不在第一组中的ID的问题,你可以用mapply(setdiff, df2, df1)
来反转这个过程,或者如果它只是ids
的练习,那么你只能做setdiff(df2$id, df1$id)
}。
有关R的功能函数(例如,mapply,sapply,lapply等)的更多信息,请参阅this post。
使用purrr
解决方案进行更新:
map2(df1, df2, setdiff) %>%
map_int(length)
答案 1 :(得分:2)
一种可能性。首先,找出两个数据集有哪些共同点。最简单的方法是:
commonID<-intersect(A$id,B$id)
然后,您可以通过以下方式确定A中缺少哪些行:
> B[!B$id %in% commonID,]
# id name dob vaccinedate vaccinename dose
# 3 100002 John Smith 2/5/2010 7/13/2013 HEPB 3
接下来,您可以将两个数据集限制为它们共有的ID。
Acommon<-A[A$id %in% commonID,]
Bcommon<-B[B$id %in% commonID,]
如果您不能假设ID的顺序正确,那么请对它们进行排序:
Acommon<-Acommon[order(Acommon$id),]
Bcommon<-Bcommon[order(Bcommon$id),]
现在你可以看到这样的字段有什么不同。
diffs<-Acommon != Bcommon
diffs
# id name dob vaccinedate vaccinename dose
# 1 FALSE FALSE FALSE FALSE FALSE TRUE
# 2 FALSE TRUE FALSE TRUE FALSE FALSE
这是一个逻辑矩阵,您可以随心所欲地做任何事情。例如,要查找每列中的错误总数:
colSums(diffs)
# id name dob vaccinedate vaccinename dose
# 0 1 0 1 0 1
要查找名称不同的所有ID:
Acommon$id[diffs[,"name"]]
# [1] 100001
等等。
答案 2 :(得分:0)
有一个新的包调用 waldo
install.packages("waldo")
library(waldo)
# construct the data frames
df1 <- structure(list(id = 100000:100001, name = structure(c(2L, 1L), .Label = c("Jane Doe","John Doe"), class = "factor"), dob = structure(1:2, .Label = c("1/1/2000", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L), .Label = c("3/14/2013", "5/20/2012"), class = "factor"), vaccinename = structure(1:2, .Label = c("MMR", "VARICELLA"), class = "factor"), dose = c(4L, 1L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(id = 100000:100002, name = structure(c(2L, 1L, 3L), .Label = c("Jane Doee", "John Doe", "John Smith"), class = "factor"), dob = structure(c(1L, 3L, 2L), .Label = c("1/1/2000", "2/5/2010", "7/3/2011"), class = "factor"), vaccinedate = structure(c(2L, 1L, 3L), .Label = c("3/24/2013", "5/20/2012", "7/13/2013"), class = "factor"), vaccinename = structure(c(2L, 3L, 1L), .Label = c("HEPB", "MMR", "VARICELLA"), class = "factor"), dose = c(3L, 1L, 3L)), .Names = c("id", "name", "dob", "vaccinedate", "vaccinename", "dose"), class = "data.frame", row.names = c(NA, -3L))
# compare them
compare(df1,df2)
我们得到:
`old` is length 2
`new` is length 3
`names(old)`: "X" "Y"
`names(new)`: "X" "Y" "Z"
`attr(old, 'row.names')`: 1 2 3
`attr(new, 'row.names')`: 1 2 3 4
`old$X`: 1 2 3
`new$X`: 1 2 3 4
`old$Y`: "a" "b" "c"
`new$Y`: "A" "b" "c" "d"
`old$Z` is absent
`new$Z` is a character vector ('k', 'l', 'm', 'n')
答案 3 :(得分:-1)
library(compareDF)
compare_df(dataframe1, dataframe2, c("columnname"))