Question

我正在对R中的单词进行二元组标记化。我只能获得一个完整的段落。我把段落分成单句。现在我需要为每个句子创建一个单词列表。

输入：

   [1] "The dog chased a cat."
   [2] "The cat climbed a tree"

输出：

    [1] [1] "The" "dog"
        [2] "chased" "the"
        [3] "cat".....

    [2] [1] "The" "cat"
        [2] "climbed" "the"

我需要R代码...... 我已经尝试使用以下代码：

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))

Answer 1

您需要将tm包与您的标记器结合使用。

library(tm)
library(RWeka)
text <- c("The dog chased a cat.", "The cat climbed a tree")


mycorp <- Corpus(VectorSource(text))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))
tdm <- TermDocumentMatrix(mycorp, control=list(tokenize = BigramTokenizer))
findFreqTerms(tdm)
[1] "a cat"       "a tree"      "cat climbed" "chased a"    "climbed a"   "dog chased"  "the cat"     "the dog"

来自段落中句子列表的单词Bigrams，R

1 个答案: