Question

我正在寻找一种使用另一个字符向量扫描字符向量的方法。我已经花了这么多时间，但似乎无法做到这一点，相反。我无法找到能够完成我打算做的事情的功能。但我确信这是解决这个问题的简单方法

所以，让我们说我有以下载体：

    c <- c("bread", "milk", "oven", "salt")

另一方面，我有一个包含句子的向量。

    text <- c("The BREAD is in the oven. Wonderful!!",
    "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
    well.", "Oven, oven, oven, why not just eat it raw.")

现在，我想使用c矢量的内容扫描文本块。输出看起来应该是这样的：

                                             text bread milk oven salt
    1       The BREAD is in the oven. Wonderful!!    1    0    1    0
    2        We don't only need Milk... as well."    0    1    0    1
    3 Oven, oven, oven, why not just eat it raw.     0    0    3    0

我想做的另一件事是搜索组合而不只是单个单词。

    c <- c("need milk", "oven oven", "eat it")

获得相同的输出：

                                             text need milk oven oven eat it
    1       The BREAD is in the oven. Wonderful!!     0         0        0
    2        We don't only need Milk... as well."     1         0        1
    3 Oven, oven, oven, why not just eat it raw.      0         2        1

如果有人可以帮助我会很棒！ :)非常感谢你！

Answer 1

我们可以使用str_count来计算“字符串”中每个pattern的出现次数

library(stringr)
data.frame(text, sapply(c, str_count, string = tolower(text)))

Answer 2

这里使用stringi包的另一个解决方案，至少在速度方面（不涉及简单性）优于其他方法。当然，这取决于“节拍”在这里的意义，如果你考虑速度与简单和使用基础R。

另外需要提及的是grepl解决方案不会返回实际计数，而是返回上面评论中指示的二进制计数。所以它不能直接比较。但是，根据您的需要，这就足够了。

library(stringi)
library(stringr)
library(microbenchmark)

c <- c("bread", "milk", "oven", "salt")
text <- c("The BREAD is in the oven. Wonderful!!",
          "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
          well.", "Oven, oven, oven, why not just eat it raw.")


stringi_approach <- function() {

  matches <- sapply(c, function(w) {stri_count_fixed(text,w, case_insensitive = TRUE)})
  rownames(matches) <- text

}

grepl_approach <- function() {

  df <- data.frame(text, +(sapply(c, grepl, tolower(text))))

}

stringr_approach <- function() {

  df <- data.frame(text, sapply(c, str_count, string = tolower(text)))

}

microbenchmark(
  grepl_approach(),
  stringr_approach(),
  stringi_approach()
)

# Unit: microseconds
#         expr       min      lq     mean   median       uq     max neval
# grepl_approach() 309.091 338.500 351.3017 347.5790 352.7105 565.679   100
# stringr_approach() 380.541 418.634 437.7599 429.2925 441.7275 814.767   100
# stringi_approach() 101.057 113.492 126.9763 129.4790 133.8215 217.903   100

Answer 3

您可以使用语料库库：

library(corpus)
library(Matrix)

text <- c("The BREAD is in the oven. Wonderful!!",
    "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
    well.", "Oven, oven, oven, why not just eat it raw.")

term_matrix(text, select = c("bread", "milk", "oven", "salt"))
## 3 x 4 sparse Matrix of class "dgCMatrix"
##      bread milk oven salt
## [1,]     1    .    1    .
## [2,]     1    1    .    1
## [3,]     .    .    3    .

term_matrix(text, select = c("need milk", "oven oven", "eat it"), drop_punct = TRUE)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##      need milk oven oven eat it
## [1,]         .         .      .
## [2,]         1         .      .
## [3,]         .         2      1

或者，您可以使用text_count代替str_count修改Manuel Bickel的答案之一。

查找R

3 个答案: