我正在处理文本数据,并正在寻找解决过滤问题的方法。
我设法找到一种解决方案,该解决方案可以过滤包含“ Word 1”的行 OR “ Word 2”
这是可复制的代码
df=data.frame(UID=c(1,2,3,4,5),Text=c("the quick brown fox jumped over the lazy dog",
"long live the king",
"I love my dog a lot",
"Tomorrow will be a rainy day",
"Tomorrow will be a sunny day"))
#Filter for rows that contain "brown" OR "dog"
filtered_results_1=dplyr::filter(df, grepl('brown|dog', Text))
但是,当我过滤同时包含'Word 1'和'Word 2'的行时,它不起作用。
#Filter for rows that contain "brown" AND "dog"
filtered_results_2=dplyr::filter(df, grepl('brown & dog', Text))
无法找到正确的语法,将不胜感激。
答案 0 :(得分:4)
您可以使用stringr::str_count
:
dplyr::mutate(df, test = stringr::str_count(Text,'brown|dog'))
# UID Text test
# 1 1 the quick brown fox jumped over the lazy dog 2
# 2 2 long live the king 0
# 3 3 I love my dog a lot 1
# 4 4 Tomorrow will be a rainy day 0
# 5 5 Tomorrow will be a sunny day 0
dplyr::filter(df, stringr::str_count(Text,'brown|dog') == 2)
# UID Text
# 1 1 the quick brown fox jumped over the lazy dog
它将计算dog
或brown
的次数,尽管它们发生的次数
以下内容比较笼统,不太优雅,但是您可以方便地将搜索到的单词放入向量中:
dplyr::filter(df, purrr::map_int(strsplit(as.character(Text),'[[:punct:] ]'),
~sum(unique(.) %in% c("brown","dog"))) == 2)
# UID Text
# 1 1 the quick brown fox jumped over the lazy dog
答案 1 :(得分:3)
我们可以使用双grepl
dplyr::filter(df, grepl('\\bbrown\\b', Text) & grepl('\\bdog\\b', Text))
或使用以下条件:我们先检查单词“ brown”(棕色),再检查单词“ dog”(狗)(注意单词边界(\\b
)以确保它与其他任何内容都不匹配)或“ dog”其次是“棕色”
dplyr::filter(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
# UID Text
#1 1 the quick brown fox jumped over the lazy dog
注意:它将检查单词边界,单词“ brown”,“ dog”以及它们在字符串中是否存在
也可以使用base R
subset(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
答案 2 :(得分:1)
尝试以下解决方案:
filtered_results_2=dplyr::filter(df, grepl('brown.*dog|dog.*brown', Text))
filtered_results_2
UID Text
1 1 the quick brown fox jumped over the lazy dog
答案 3 :(得分:1)
使用sqldf
:
library(sqldf)
sqldf("select * from df where Text like '%dog%' AND Text like '%brown%'")
输出:
UID Text
1 1 the quick brown fox jumped over the lazy dog
答案 4 :(得分:1)
类似于先前的答案,但使用base
df[grepl("(?=.*dog)(?=.*brown)", df$Text, perl = TRUE),]
UID Text
1 1 the quick brown fox jumped over the lazy dog