我正试图在我的语料库中实现quanteda,但我得到了:
Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, :
duplicate row.names: character(0)
我对此没有多少经验。以下是数据集的下载:https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0
以下是代码:
tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)
quanteda.corpus <- corpus(corpus)
答案 0 :(得分:1)
你用tm做的处理是为tm准备一个对象而quanteda不知道如何处理它... quanteda自己完成所有这些步骤,帮助(&#34; dfm&#34;),从选项中可以看出。
如果您尝试以下操作,可以继续:
dfm(tweets $ Tweet,verbose = TRUE,toLower = TRUE,removeNumbers = TRUE,removePunct = TRUE,removeTwitter = TRUE,language =&#34; english&#34 ;, ignoredFeatures = stopwords(&#34; english&# 34;),stem = TRUE)
从角色矢量创建dfm ...... ...小写 ......令人信服的 ...索引文件:6,943个文件 ...索引功能:15,164种功能类型 ...从174个提供的(glob)要素类型中删除了161个要素 ...词干特征(英文),修剪2175特征变体 ...创建了6943 x 12828稀疏dfm ......完成 经过时间:0.756秒。 HTH
答案 1 :(得分:1)
无需从 tm 包开始,甚至根本不使用read.csv()
- 这就是 quanteda 配套包 readtext < / strong>是为了。
因此,要读入数据,您可以将readtext::readtext()
创建的对象直接发送到语料库构造函数:
myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
##
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1 19 21 1 2 0.7579
## text2 18 20 2 2 0.8775
## text3 23 24 1 -1 0.6805
## text5 17 19 2 0 1.0000
## text4 18 19 1 -1 0.8820
##
## Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:
从那里,您可以直接在dfm()
电话中执行所有预处理词干,包括选择ngrams:
# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete.
## Elapsed time: 0.662 seconds.
# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete.
## Elapsed time: 1.419 seconds.