我有两个数据集。数据集1和数据集2如下:
dataSet1的: -
family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2
Dataset2: -
family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female
以下是在dataset1中输入的成员数与在dataset2中输入的个人详细信息之间的不匹配。与家庭ID 2一样,数据集1中的成员数量为3,但数据集2中只有2个成员的条目。 如何识别两个数据集之间的这些类型的不匹配?
答案 0 :(得分:1)
我们可以使用count
来计算家庭成员的数量并创建新的数据框df3
,然后使用setequal
来比较df1
和{{1} }。
df3
数据强>
library(dplyr)
df3 <- df2 %>%
count(family_id, house_id) %>%
rename(number_family_member = n)
setequal(df1, df3)
# FALSE: Rows in x but not y: 2, 3. Rows in y but not x: 2, 3.
答案 1 :(得分:1)
可以使用aggregate
和merge
完成此操作。
agg <- aggregate(family_id ~ factor(family_id), dataset2, length)
mrg <- merge(agg, dataset1[c(1, 3)], by.x = "factor(family_id)", by.y = "family_id")
result <- data.frame(family_id = dataset1$family_id)
result$Match <- ifelse(dataset1$number_family_member == mrg$family_id, "match", "mismatch")
result
# family_id Match
#1 1 match
#2 2 mismatch
#3 3 mismatch
rm(agg, mrg) # final clean up
DATA。
dataset1 <- read.table(text = "
family_id house_id number_family_member
1 1052 2
2 5042 3
3 1111 2
", header = TRUE)
dataset2 <- read.table(text = "
family_id house_id age gender
1 1052 24 male
1 1052 25 female
2 5042 23 male
2 5042 20 female
3 1111 1 male
3 1111 20 female
3 1111 21 female
", header = TRUE)
答案 2 :(得分:0)
这两种观点可能对您有所帮助:
dataset2 %>%
add_count(family_id) %>%
inner_join(dataset1) %>%
mutate(match= n ==number_family_member)
# # A tibble: 7 x 7
# family_id house_id age gender n number_family_member match
# <int> <int> <int> <fctr> <int> <int> <lgl>
# 1 1 1052 24 male 2 2 TRUE
# 2 1 1052 25 female 2 2 TRUE
# 3 2 5042 23 male 2 3 FALSE
# 4 2 5042 20 female 2 3 FALSE
# 5 3 1111 1 male 3 2 FALSE
# 6 3 1111 20 female 3 2 FALSE
# 7 3 1111 21 female 3 2 FALSE
dataset2 %>%
count(family_id) %>%
inner_join(dataset1) %>%
mutate(match= n ==number_family_member)
# # A tibble: 3 x 5
# family_id n house_id number_family_member match
# <int> <int> <int> <int> <lgl>
# 1 1 2 1052 2 TRUE
# 2 2 2 5042 3 FALSE
# 3 3 3 1111 2 FALSE