我正在寻找一种使用另一个字符向量扫描字符向量的方法。我已经花了这么多时间,但似乎无法做到这一点,相反。我无法找到能够完成我打算做的事情的功能。但我确信这是解决这个问题的简单方法
所以,让我们说我有以下载体:
c <- c("bread", "milk", "oven", "salt")
另一方面,我有一个包含句子的向量。
text <- c("The BREAD is in the oven. Wonderful!!",
"We don't only need Milk to bake a yummy bread, but a pinch of salt as
well.", "Oven, oven, oven, why not just eat it raw.")
现在,我想使用c矢量的内容扫描文本块。输出看起来应该是这样的:
text bread milk oven salt
1 The BREAD is in the oven. Wonderful!! 1 0 1 0
2 We don't only need Milk... as well." 0 1 0 1
3 Oven, oven, oven, why not just eat it raw. 0 0 3 0
我想做的另一件事是搜索组合而不只是单个单词。
c <- c("need milk", "oven oven", "eat it")
获得相同的输出:
text need milk oven oven eat it
1 The BREAD is in the oven. Wonderful!! 0 0 0
2 We don't only need Milk... as well." 1 0 1
3 Oven, oven, oven, why not just eat it raw. 0 2 1
如果有人可以帮助我会很棒! :)非常感谢你!
答案 0 :(得分:4)
我们可以使用str_count
来计算“字符串”中每个pattern
的出现次数
library(stringr)
data.frame(text, sapply(c, str_count, string = tolower(text)))
答案 1 :(得分:1)
这里使用stringi
包的另一个解决方案,至少在速度方面(不涉及简单性)优于其他方法。当然,这取决于“节拍”在这里的意义,如果你考虑速度与简单和使用基础R。
另外需要提及的是grepl
解决方案不会返回实际计数,而是返回上面评论中指示的二进制计数。所以它不能直接比较。但是,根据您的需要,这就足够了。
library(stringi)
library(stringr)
library(microbenchmark)
c <- c("bread", "milk", "oven", "salt")
text <- c("The BREAD is in the oven. Wonderful!!",
"We don't only need Milk to bake a yummy bread, but a pinch of salt as
well.", "Oven, oven, oven, why not just eat it raw.")
stringi_approach <- function() {
matches <- sapply(c, function(w) {stri_count_fixed(text,w, case_insensitive = TRUE)})
rownames(matches) <- text
}
grepl_approach <- function() {
df <- data.frame(text, +(sapply(c, grepl, tolower(text))))
}
stringr_approach <- function() {
df <- data.frame(text, sapply(c, str_count, string = tolower(text)))
}
microbenchmark(
grepl_approach(),
stringr_approach(),
stringi_approach()
)
# Unit: microseconds
# expr min lq mean median uq max neval
# grepl_approach() 309.091 338.500 351.3017 347.5790 352.7105 565.679 100
# stringr_approach() 380.541 418.634 437.7599 429.2925 441.7275 814.767 100
# stringi_approach() 101.057 113.492 126.9763 129.4790 133.8215 217.903 100
答案 2 :(得分:0)
您可以使用语料库库:
library(corpus)
library(Matrix)
text <- c("The BREAD is in the oven. Wonderful!!",
"We don't only need Milk to bake a yummy bread, but a pinch of salt as
well.", "Oven, oven, oven, why not just eat it raw.")
term_matrix(text, select = c("bread", "milk", "oven", "salt"))
## 3 x 4 sparse Matrix of class "dgCMatrix"
## bread milk oven salt
## [1,] 1 . 1 .
## [2,] 1 1 . 1
## [3,] . . 3 .
term_matrix(text, select = c("need milk", "oven oven", "eat it"), drop_punct = TRUE)
## 3 x 3 sparse Matrix of class "dgCMatrix"
## need milk oven oven eat it
## [1,] . . .
## [2,] 1 . .
## [3,] . 2 1
或者,您可以使用text_count
代替str_count
修改Manuel Bickel的答案之一。