我一直在与R争吵,使用朴素贝叶斯分类器模型对推文进行分类。
数据:
包含2列的训练集:Tweet和Class。 共有300条推文:150条被归类为“App”,150条被归类为“其他”。
目的:
测试集有20个数据点(推文) - 前10个是“App”,后10个是“其他”。 我想预测一下。我可以在Excel(blekh)中成功生成朴素贝叶斯模型,并正确预测20个中的19个。
我想用R.复制这个。
代码段
library(tm)
library('e1071')
# Custom Function
replacePunctuation <- function(x)
{
x <- tolower(x)
x <- gsub("[.]+[ ]"," ",x)
x <- gsub("[:]+[ ]"," ",x)
x <- gsub("[?]"," ",x)
x <- gsub("[!]"," ",x)
x <- gsub("[;]"," ",x)
x <- gsub("[,]"," ",x)
x
}
# Process text - tolower(), remove punctuation etc.
tweets.all$Tweet <- replacePunctuation(tweets.all$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)
# Create a corpus for training and testing data set
tweets.train.corpus <- Corpus(VectorSource(as.vector(tweets.all$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))
# Create term document matrix but only get word lenghts that are 4 or above
tweets.train.matrix <- t(TermDocumentMatrix(tweets.train.corpus,control=list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));
# Build model with additive smoothing as 1
model <- naiveBayes(as.matrix((tweets.train.matrix)),as.factor(tweets.all$class),laplace=1)
#Predict
results <- predict(object=model,newdata=as.matrix(tweets.test.matrix));
results
数据样本
对head(tweets.all)的调用产生:
Tweet class
1 [blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail: https://opensourcehacker.com/2013/03/25/using-nullmailer-and-mandrill-for-your-ubuntu-linux-server-outboud-mail/?utm_source=twitterfeed&utm_medium=twitter #plone App
2 [blog] Using Postfix and free Mandrill email service for SMTP on Ubuntu Linux server: https://opensourcehacker.com/2013/03/26/using-postfix-and-free-mandrill-email-service-for-smtp-on-ubuntu-linux-server/?utm_source=twitterfeed&utm_medium=twitter #plone App
3 @aalbertson There are several reasons emails go to spam. Mind submitting a request at http://help.mandrill.com with additional details? App
4 @adrienneleigh I just switched it over to Mandrill, let's see if that improve the speed at which the emails are sent. App
5 @ankeshk +1 to @mailchimp We use MailChimp for marketing emails and their Mandrill app for txn emails... @sampad @abhijeetmk @hiway App
6 @biggoldring That error may occur if unsupported auth method used. Can you email us via http://help.mandrill.com so we can get details? App
对head(tweets.test)的调用产生:
Tweet
1 Just love @mandrillapp transactional email service - http://mandrill.com Sorry @SendGrid and @mailjet #timetomoveon
2 @rossdeane Mind submitting a request at http://help.mandrill.com with account details if you haven't already? Glad to take a look!
3 @veroapp Any chance you'll be adding Mandrill support to Vero?
4 @Elie__ @camj59 jparle de relai SMTP!1 million de mail chez mandrill / mois comparŽ ˆ 1 million sur lite sendgrid y a pas photo avec mailjet
5 would like to send emails for welcome, password resets, payment notifications, etc. what should i use? was looking at mailgun/mandrill
6 From Coworker about using Mandrill: "I would entrust email handling to a Pokemon".
输出
这就是我得到的:
[1] Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other
Levels: App Other
这是垃圾 - 即没有正确分类。知道我做错了什么吗?