我正在对R中的单词进行二元组标记化。我只能获得一个完整的段落。我把段落分成单句。现在我需要为每个句子创建一个单词列表。
输入:
[1] "The dog chased a cat."
[2] "The cat climbed a tree"
输出:
[1] [1] "The" "dog"
[2] "chased" "the"
[3] "cat".....
[2] [1] "The" "cat"
[2] "climbed" "the"
我需要R代码...... 我已经尝试使用以下代码 :
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))
答案 0 :(得分:1)
您需要将tm包与您的标记器结合使用。
library(tm)
library(RWeka)
text <- c("The dog chased a cat.", "The cat climbed a tree")
mycorp <- Corpus(VectorSource(text))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))
tdm <- TermDocumentMatrix(mycorp, control=list(tokenize = BigramTokenizer))
findFreqTerms(tdm)
[1] "a cat" "a tree" "cat climbed" "chased a" "climbed a" "dog chased" "the cat" "the dog"