在R中搜索数据框中的文本

时间:2018-05-24 05:22:41

标签: r

在R

中有2个不同的数据框

A - 数据集具有以下数据

cat
dog
Rat
Parrot
Tiger

B - 数据集具有以下数据

Give milk to cat
dog bites
life span of dog is 10 years
Cow gives us milk
Tiger have huge Jaws

现在,R代码必须检查数据集A中每个值的整个B数据。

2 个答案:

答案 0 :(得分:1)

选项是使用apply并查找df_A中的每个字词df_BOP未明确指定预期格式。找到的df_A中的单词可以在最终输出中使用unlistunique列出。

library(dplyr)
apply(df_B,1, function(x){
  df_A$Word[(df_A$Word %in% strsplit(x, split=" ")[[1]])]
}) %>% unlist() %>% unique()
#[1] "cat"   "dog"   "Tiger"

#If objective is to find which row in B contains at least a word from df_A then:
df_B$Have_A <- mapply(function(x){
  any(df_A$Word %in% strsplit(x, split=" ")[[1]])
}, df_B$Text)

df_B
#                           Text Have_A
# 1             Give milk to cat   TRUE
# 2                    dog bites   TRUE
# 3 life span of dog is 10 years   TRUE
# 4            Cow gives us milk  FALSE
# 5     Cow have huge advantages   TRUE

数据:

df_B <- read.table(text =
"Text 
'Give milk to cat'
'dog bites'
'life span of dog is 10 years'
'Cow gives us milk'
'Tiger have huge Jaws'",
header = TRUE, stringsAsFactors = FALSE)



df_A <- read.table(text =
"Word 
cat
dog
Rat
Parrot
Tiger",
header = TRUE, stringsAsFactors = FALSE)

答案 1 :(得分:1)

我们可以paste&#39; A&#39;中的列的元素数据集并将其用作pattern中的grepl,以通过检查&#39; B&#39;中的字符串来获取逻辑向量。数据集列

i1 <- grepl(paste0("\\b(", paste(A$col, collapse="|"), ")\\b"),
      B$col, ignore.case = TRUE)
i1
#[1]  TRUE  TRUE  TRUE FALSE TRUE

B$col[i1]

数据

A <- structure(list(col = c("cat", "dog", "Rat", "Parrot", "Tiger"
)), .Names = "col", class = "data.frame", row.names = c(NA, -5L
))

B <- structure(list(col = c("Give milk to cat", "dog bites", 
  "life span of dog is 10 years", 
 "Cow gives us milk", "Tiger have huge Jaws")), .Names = "col",
 class = "data.frame", row.names = c(NA, -5L))