检查一个数据帧中的值是否在另一个(更大)数据帧中

时间:2013-06-13 10:49:23

标签: r dataframe

我正在努力想出一个针对以下问题的矢量化解决方案。我有两个数据帧:

> people <- data.frame(name = c('Fred', 'Bob'), profession = c('Builder', 'Baker'))
> people
  name profession
1 Fred    Builder
2  Bob      Baker

> allowed <- data.frame(name = c('Fred', 'Fred', 'Bob', 'Bob'), profession = c('Builder', 'Baker', 'Barman', 'Biker'))
> allowed
  name profession
1 Fred    Builder
2 Fred      Baker
3  Bob     Barman
4  Bob     Biker

也就是说,我想检查一下每个人都有一个允许的职业,并返回任何没有的职业。

例如,弗雷德可以是建筑师或贝克,所以他很好。但是,Bob可以是Barman或Biker,但不是Baker(注意:在我的用例中只有两个允许的职业)。

我想返回一个数据框,这些名称没有允许的职业:

name profession permitted
1 Bob Baker Biker
2 Bob Baker Barman

感谢您的帮助

4 个答案:

答案 0 :(得分:1)

简单的基础解决方案。我相信有人可以提出更好的东西。

out <- allowed[!allowed$name %in% merge(people, allowed)$name, ]

这可以让你获得所需的人以及他们允许的职业。如果你也想要他们的实际职业:

names(out)[2] <- "permitted"
out <- merge(people, out, all.y=TRUE)

答案 1 :(得分:1)

这是一个稍微更具可读性的data.table解决方案。如果您认为可读,您可以在同一行上执行最后一步,使其成为单行。

# load library, convert people to a data.table and set a key
library(data.table)
people = data.table(people, key = "name,profession")

# compute
result = data.table(allowed, key = "name")[people[!allowed]]
setnames(result, "profession.1", "permitted")

result
#   name profession permitted
#1:  Bob     Barman     Baker
#2:  Bob      Biker     Baker

答案 2 :(得分:0)

可能还有另一种方式,但这应该有效。我添加了第三个具有不允许专业的人员,向您展示如何将该功能应用于整个数据集。

currentprof <-structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Bob", 
"Fred", "Jan"), class = "factor"), profession = structure(c(3L, 
2L, 1L), .Label = c("Analyst", "Baker", "Builder"), class = "factor")), .Names = c("name", 
"profession"), class = "data.frame", row.names = c(NA, -3L))

allowed <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("Bob", 
"Fred", "Jan"), class = "factor"), profession = structure(c(4L, 
1L, 2L, 3L, 6L, 5L), .Label = c("Baker", "Barman", "Biker", "Builder", 
"Driver", "Teacher"), class = "factor")), .Names = c("name", 
"profession"), class = "data.frame", row.names = c(NA, -6L))

checkprof <- function(name){
allowedn <- allowed[allowed$name == name,]
currentprofn <- currentprof[currentprof$name==name,]
if(!currentprofn$profession %in% allowedn$profession)
{result <- merge(currentprofn, allowedn, by = "name", all.x=TRUE)} else
{result <-data.frame(col1=character(),
                 col2=character(), 
                 col3=character(), 
                 stringsAsFactors=FALSE)}
colnames(result) <- c("name","profession","permitted")
return(result)
}


do.call(rbind,lapply(levels(allowed$name),checkprof))

答案 3 :(得分:0)

这是我的看法。可能需要更多测试。我会自己接受建议。它适用于您的示例,但我不确定它是否会概括。

people$check <- ifelse(people$profession %in% allowed[which(allowed$name == people$name),"profession"], TRUE,FALSE)

people_select <- people[people$check == TRUE,]

编辑:只是为了澄清,以防止你退出投票。 ifelse是矢量化的,并且运行速度非常快。