我无法让我的n-gram标记器工作。 unigram似乎工作正常,但是一旦我将bigram tokenizer应用到语料库,它就会给我回复与unigram tokenizer相同的单词列表。代码如下。
##Loading the data may be part of the problem
blogs <- readLines("en_US.blogs.txt",
encoding = "UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt",
encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt",
encoding = "UTF-8", skipNul=TRUE)
blogs_sample <- SampleData(blogs, 0.01)
writeLines(blogs_sample, "blogs_sample.txt")
news_sample <- SampleData(news, 0.01)
writeLines(news_sample, "news_sample.txt")
twitter_sample <- SampleData(twitter, 0.01)
writeLines(twitter_sample, "twitter_sample.txt")
corpus <- Corpus(DirSource("/Users/calvin.hutto/Desktop/R/Coursera
Capstone/final/en_US/sample", encoding = "UTF-8"),
readerControl = list(language = "en_US"))
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm_1 <- TermDocumentMatrix(corpus, control = list (tokenize = UnigramTokenizer))
tdm_2 <- TermDocumentMatrix(corpus, control = list (tokenize = BigramTokenizer))
tdm_3 <- TermDocumentMatrix(corpus, control = list (tokenize = TrigramTokenizer))
任何帮助都会得到赞赏!!!
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS:
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.7-1 NLP_0.1-10
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 digest_0.6.12 crayon_1.3.2 SnowballC_0.5.1 slam_0.1-40 bitops_1.0-6 R6_2.2.2
[8] magrittr_1.5 swirl_2.4.3 httr_1.2.1 stringi_1.1.5 testthat_1.0.2 tools_3.4.0 stringr_1.2.0
[15] RCurl_1.95-4.8 yaml_2.1.14 parallel_3.4.0 compiler_3.4.0
答案 0 :(得分:0)
看起来真的很复杂。如何更简单的方法?
require(readtext)
require(quanteda)
mycorpus <- corpus(readtext("/Users/calvin.hutto/Desktop/R/Coursera Capstone/final/en_US/sample/*.txt"))
mydfm <- dfm(mycorpus, ngrams = 1:2, remove_punct = TRUE)
head(mydfm)
我无法显示输出,因为我没有您的数据,但这应该可以正常工作。