朴素贝叶斯分类与R.

时间:2014-05-03 15:49:19

标签: r twitter classification bayesian naivebayes

我一直在与R争吵,使用朴素贝叶斯分类器模型对推文进行分类。

数据

包含2列的训练集:Tweet和Class。 共有300条推文:150条被归类为“App”,150条被归类为“其他”。

目的:

测试集有20个数据点(推文) - 前10个是“App”,后10个是“其他”。 我想预测一下。我可以在Excel(blekh)中成功生成朴素贝叶斯模型,并正确预测20个中的19个。

我想用R.复制这个。

代码段

library(tm)
library('e1071')

# Custom Function 
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x
}

# Process text - tolower(), remove punctuation etc. 
tweets.all$Tweet <- replacePunctuation(tweets.all$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)

# Create a corpus for training and testing data set
tweets.train.corpus <- Corpus(VectorSource(as.vector(tweets.all$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))

# Create term document matrix but only get word lenghts that are 4 or above
tweets.train.matrix <- t(TermDocumentMatrix(tweets.train.corpus,control=list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));

# Build model with additive smoothing as 1
model <- naiveBayes(as.matrix((tweets.train.matrix)),as.factor(tweets.all$class),laplace=1)

#Predict
results <- predict(object=model,newdata=as.matrix(tweets.test.matrix));
results

数据样本

对head(tweets.all)的调用产生:

 Tweet class
 1                            [blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail:  https://opensourcehacker.com/2013/03/25/using-nullmailer-and-mandrill-for-your-ubuntu-linux-server-outboud-mail/?utm_source=twitterfeed&utm_medium=twitter  #plone   App
 2                     [blog] Using Postfix and free Mandrill email service for SMTP on Ubuntu Linux server:  https://opensourcehacker.com/2013/03/26/using-postfix-and-free-mandrill-email-service-for-smtp-on-ubuntu-linux-server/?utm_source=twitterfeed&utm_medium=twitter  #plone   App
 3 @aalbertson There are several reasons emails go to spam. Mind submitting a request at http://help.mandrill.com  with additional details?   App
 4                    @adrienneleigh I just switched it over to Mandrill, let's see if that improve the speed at which the emails are sent.   App
 5      @ankeshk +1 to @mailchimp We use MailChimp for marketing emails and their Mandrill app for txn emails... @sampad @abhijeetmk @hiway   App
 6 @biggoldring That error may occur if unsupported auth method used. Can you email us via http://help.mandrill.com  so we can get details?   App

对head(tweets.test)的调用产生:

Tweet
1   Just love @mandrillapp transactional email service - http://mandrill.com Sorry @SendGrid and @mailjet #timetomoveon
2   @rossdeane Mind submitting a request at http://help.mandrill.com with account details if you haven't already? Glad to take a look!
3   @veroapp Any chance you'll be adding Mandrill support to Vero?
4   @Elie__ @camj59 jparle de relai SMTP!1 million de mail chez mandrill / mois comparŽ ˆ 1 million sur lite sendgrid y a pas photo avec mailjet
5   would like to send emails for welcome, password resets, payment notifications, etc. what should i use? was looking at mailgun/mandrill
6   From Coworker about using Mandrill:  "I would entrust email handling to a Pokemon".

输出

这就是我得到的:

 [1] Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other
 Levels: App Other

这是垃圾 - 即没有正确分类。知道我做错了什么吗?

0 个答案:

没有答案