如何将数据框与列表进行比较,并返回与列表匹配的数据框中的值?

时间:2018-02-06 00:17:32

标签: r grep grepl

总新手R问题。我有一个ID和Notes的数据框df:

ID    Notes
1     dogs are friendly
2     dogs and cats are pets
3     cows live on farms
4     cats and cows start with c

我有另一个价值观清单"动物"

cats
cows

我想添加另一列"匹配"到包含Notes中所有动物的数据框,例如

ID    Notes                        Matches
1     dogs are friendly            
2     dogs and cats are pets       cats
3     cows live on farms           cows
4     cats and cows start with c   cats, cows

到目前为止,我唯一的运气是使用grepl返回,如果有任何匹配:

grepl(paste(animals,collapse="|"),df$Notes,ignore.case = T)

如何返回值?

更新
我的数据框中有一些行,我有多个猫实例,例如,在我的笔记中:

ID    Notes                             Matches
1     dogs are friendly            
2     dogs and cats are pets            cats
3     cows live on farms                cows
4     cats and cats cows start with c   cats, cows

我只想返回一个匹配的实例。 @LachlanO让我非常接近他的解决方案,但我得到了:

[1] "NA, NA"                      "cats, NA"                    "NA, cows"                    "c(\"cats\", \"cats\"), cows"

我怎样才能返回不同的匹配?

3 个答案:

答案 0 :(得分:1)

编辑:添加了unique操作来处理重复的匹配。

我可以启动你,然后指出你的方向:)

下面使用stringr :: str_extract_all来提取我们需要的相关位,但遗憾的是,它们给我们留下了一些我们不知道的位,最值得注意的是它是空白的。我们的自定义函数中间的unique函数只是确保我们按元素获取唯一匹配元素。

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

animals = c("cats", "cows")

matches <- as.data.frame(sapply(animals, function(x){sapply(stringr::str_extract_all(df$Notes, x), unique)}, simplify = TRUE))
matches[matches == "character(0)"] <- NA

apply(matches, 1, paste, collapse = ", ")
[1] "NA, NA"     "cats, NA"   "NA, cows"   "cats, cows"

您可以将此设置为您的额外列,但由于这些新增功能并不好。如果有一个粘贴函数忽略了NAs,我们就会设置它。

幸运的是,另一位用户已经解决了这个问题:) Check out this answer here.

与上述相结合应该为您提供合适的解决方案!

答案 1 :(得分:0)

我将如何做到这一点:

animals = c("cats", "cows")
reg = paste(animals, collapse = "|")

library(stringr)
matches = str_extract_all(Notes, reg)
matches = lapply(matches, unique)
matches = sapply(matches, paste, collapse = ",")

df$matches = matches
df
#   ID                       Notes   matches
# 1  1           dogs are friendly          
# 2  2      dogs and cats are pets      cats
# 3  3          cows live on farms      cows
# 4  4 cats and cows start with c  cats,cows

如果你想了解它,请在正则表达式上粘贴单词边界,例如reg = paste("\\b", animals, "\\b", collapse = "|"),以避免提取单词的中间部分。

使用LachlanO提供的数据:

ID = seq(1,4)
Notes <- c(
  "dogs are friendly",
  "dogs and cats are pets",
  "cows live on farms",
  "cats and cows start with c "
)
df <- data.frame(ID, Notes)

答案 2 :(得分:0)

您可以使用gsub一次获得所有动物:

gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T)
[1] ""          "cats "     "cows"      "cats cows"

因此,写在一个通道:

transform(df,matches=gsub(".*?(cows|cats )|.*","\\1",do.call(paste,df),perl = T))
  ID                       Notes   matches
1  1           dogs are friendly          
2  2      dogs and cats are pets     cats 
3  3          cows live on farms      cows
4  4 cats and cows start with c  cats cows