查找R

时间:2017-11-21 09:03:07

标签: r vector find text-mining

我正在寻找一种使用另一个字符向量扫描字符向量的方法。我已经花了这么多时间,但似乎无法做到这一点,相反。我无法找到能够完成我打算做的事情的功能。但我确信这是解决这个问题的简单方法

所以,让我们说我有以下载体:

    c <- c("bread", "milk", "oven", "salt")

另一方面,我有一个包含句子的向量。

    text <- c("The BREAD is in the oven. Wonderful!!",
    "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
    well.", "Oven, oven, oven, why not just eat it raw.")

现在,我想使用c矢量的内容扫描文本块。输出看起来应该是这样的:

                                             text bread milk oven salt
    1       The BREAD is in the oven. Wonderful!!    1    0    1    0
    2        We don't only need Milk... as well."    0    1    0    1
    3 Oven, oven, oven, why not just eat it raw.     0    0    3    0

我想做的另一件事是搜索组合而不只是单个单词。

    c <- c("need milk", "oven oven", "eat it")

获得相同的输出:

                                             text need milk oven oven eat it
    1       The BREAD is in the oven. Wonderful!!     0         0        0
    2        We don't only need Milk... as well."     1         0        1
    3 Oven, oven, oven, why not just eat it raw.      0         2        1

如果有人可以帮助我会很棒! :)非常感谢你!

3 个答案:

答案 0 :(得分:4)

我们可以使用str_count来计算“字符串”中每个pattern的出现次数

library(stringr)
data.frame(text, sapply(c, str_count, string = tolower(text)))

答案 1 :(得分:1)

这里使用stringi包的另一个解决方案,至少在速度方面(不涉及简单性)优于其他方法。当然,这取决于“节拍”在这里的意义,如果你考虑速度与简单和使用基础R。

另外需要提及的是grepl解决方案不会返回实际计数,而是返回上面评论中指示的二进制计数。所以它不能直接比较。但是,根据您的需要,这就足够了。

library(stringi)
library(stringr)
library(microbenchmark)

c <- c("bread", "milk", "oven", "salt")
text <- c("The BREAD is in the oven. Wonderful!!",
          "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
          well.", "Oven, oven, oven, why not just eat it raw.")


stringi_approach <- function() {

  matches <- sapply(c, function(w) {stri_count_fixed(text,w, case_insensitive = TRUE)})
  rownames(matches) <- text

}

grepl_approach <- function() {

  df <- data.frame(text, +(sapply(c, grepl, tolower(text))))

}

stringr_approach <- function() {

  df <- data.frame(text, sapply(c, str_count, string = tolower(text)))

}

microbenchmark(
  grepl_approach(),
  stringr_approach(),
  stringi_approach()
)

# Unit: microseconds
#         expr       min      lq     mean   median       uq     max neval
# grepl_approach() 309.091 338.500 351.3017 347.5790 352.7105 565.679   100
# stringr_approach() 380.541 418.634 437.7599 429.2925 441.7275 814.767   100
# stringi_approach() 101.057 113.492 126.9763 129.4790 133.8215 217.903   100

答案 2 :(得分:0)

您可以使用语料库库:

library(corpus)
library(Matrix)

text <- c("The BREAD is in the oven. Wonderful!!",
    "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
    well.", "Oven, oven, oven, why not just eat it raw.")

term_matrix(text, select = c("bread", "milk", "oven", "salt"))
## 3 x 4 sparse Matrix of class "dgCMatrix"
##      bread milk oven salt
## [1,]     1    .    1    .
## [2,]     1    1    .    1
## [3,]     .    .    3    .

term_matrix(text, select = c("need milk", "oven oven", "eat it"), drop_punct = TRUE)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##      need milk oven oven eat it
## [1,]         .         .      .
## [2,]         1         .      .
## [3,]         .         2      1

或者,您可以使用text_count代替str_count修改Manuel Bickel的答案之一。