Question

我有一个常规单词（＆＃34;已激活＆＃34;）或通配符（＆＃34; activat *＆＃34;）的向量。我想：

1）计算每个单词在给定文本中出现的次数（即，如果＆＃34;已激活＆＃34;出现在文本中，＆＃34;已激活＆＃34;频率为1）。

2）计算每个单词通配符在文本中出现的次数（即，如果＆＃34;已激活＆＃34;＆＃34;激活＆＃34;出现在文本中，＆＃34;激活*＆＃ 34;频率为2）。

我能够达到（1），但不能达到（2）。有人可以帮忙吗？感谢。

library(tm)
library(qdap)
text <- "activation has begun. system activated"
text <- Corpus(VectorSource(text))
words <- c("activation", "activated", "activat*")

# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)

# Result:
#      docs    word.count    activation    activated    activat*
# 1   doc 1             5     1(20.00%)    1(20.00%)           0

Answer 1

这可能与版本有关吗？我运行完全相同的代码（见下文），得到了你的预期

    > text <- "activation has begunm system activated"
    > text <- Corpus(VectorSource(text))
    > words <- c("activation", "activated", "activat")
    > apply_as_df(text, termco, match.list=words)
       docs word.count activation activated   activat
    1 doc 1          5  1(20.00%) 1(20.00%) 2(40.00%)

以下是我运行R.version()时的输出。我在Windows 10上的RStudio版本0.99.491中运行它。

    > R.Version()

    $platform
    [1] "x86_64-w64-mingw32"

    $arch
    [1] "x86_64"

    $os
    [1] "mingw32"

    $system
    [1] "x86_64, mingw32"

    $status
    [1] ""

    $major
    [1] "3"

    $minor
    [1] "2.3"

    $year
    [1] "2015"

    $month
    [1] "12"

    $day
    [1] "10"

    $`svn rev`
    [1] "69752"

    $language
    [1] "R"

    $version.string
    [1] "R version 3.2.3 (2015-12-10)"

    $nickname
    [1] "Wooden Christmas-Tree"

希望这有帮助

Answer 2

也许考虑使用库stringi的不同方法？

text <- "activation has begun. system activated"
words <- c("activation", "activated", "activat*")

library(stringi)
counts <- unlist(lapply(words,function(word)
{
  newWord <- stri_replace_all_fixed(word,"*", "\\p{L}")
  stri_count_regex(text, newWord)
}))

ratios <- counts/stri_count_words(text)
names(ratios) <- words
ratios

结果是：

activation  activated   activat* 
0.2         0.2        0.4

在代码中我将*转换为\ p {L}，这意味着正则表达式中的任何字母。之后，我计算发现了正则表达式。

计算单词通配符在文本中出现的次数（在R中）

2 个答案: