我有一个项目列表和一个搜索词列表,我正在尝试做两件事:
因此,给出以下数据框:
items
1 alex
2 alex is a person
3 this is a test
4 false
5 this is cathy
以及以下搜索字词列表:
"alex" "bob" "cathy" "derrick" "erica" "ferdinand"
我想创建以下输出:
items matches original
1 alex TRUE alex
2 alex is a person TRUE alex
3 this is a test FALSE <NA>
4 false FALSE <NA>
5 this is cathy TRUE cathy
第1步非常简单,但我遇到了第(2)步的问题。要创建“匹配”#39;如果grepl()
中的行位于搜索字词列表中,则使用TRUE
创建d$items
变量,否则FALSE
。grep()
。
对于第2步,我的想法是我应该能够在指定value = T
时使用 items matches original
1 alex TRUE alex
2 alex is a person TRUE alex is a person
3 this is a test FALSE <NA>
4 false FALSE <NA>
5 this is cathy TRUE this is cathy
,如下面的代码所示。但是,这会返回错误的值:而不是返回与grep匹配的原始搜索词,而是返回匹配项的值。所以我得到以下输出:
# Dummy data and search terms
d = data.frame(items = c("alex", "alex is a person", "this is a test", "false", "this is cathy"))
searchTerms = c("alex", "bob", "cathy", "derrick", "erica", "ferdinand")
# Return true iff search term is found in items column, not between letters
d$matches = grepl(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])",
searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "",
collapse = "|"), d[,1], ignore.case = TRUE
)
# Subset data
dMatched = d[d$matches==T,]
# This is where the problem is: return the value that was originally matched with grepl above
dMatched$original = grep(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])",
searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "",
collapse = "|"), dMatched[,1], ignore.case = TRUE, value = TRUE
)
d$original[d$matches==T] = dMatched$original
这是我现在正在使用的代码。任何想法都会非常感激!
{{1}}
答案 0 :(得分:3)
感谢Dason的帮助提示!我能够使用regmatches()
来解决我的问题。这是我的代码,从最初的问题开始:
# This is where the problem is: return the value that was originally matched with grepl above
m = regexpr(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])",
searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "",
collapse = "|"), dMatched[,1], ignore.case = TRUE
)
dMatched$original = regmatches(dMatched[,1], m)
d$original[d$matches==T] = dMatched$original
这将返回以下输出,这正是我想要的:
items matches original
1 alex TRUE alex
2 alex is a person TRUE alex
3 this is a test FALSE <NA>
4 false FALSE <NA>
5 this is cathy TRUE cathy
答案 1 :(得分:2)
不完全符合您的要求,但您可以使用qdap
的{{1}}功能来执行此操作。如果您在同一个句子中有两个名字,这将有所帮助:
termco
要获得qdap所需的内容,您可以使用:
library(qdap)
termco(d$items, 1:nrow(d), searchTerms)
## > termco(d$items, 1:nrow(d), searchTerms)
## nrow(d word.count alex bob cathy derrick erica ferdinand
## 1 1 1 1(100.00%) 0 0 0 0 0
## 2 2 4 1(25.00%) 0 0 0 0 0
## 3 3 4 0 0 0 0 0 0
## 4 4 1 0 0 0 0 0 0
## 5 5 3 0 0 1(33.33%) 0 0 0