我的代码如下:
library('RMySQL')
library('DMwR')
library('tm')
library('Snowball')
library('SnowballC')
con <- dbConnect(MySQL(), user="root", password="stuff0645", dbname="TwitterCelebs", host="localhost")
rt_outlier <- dbGetQuery(con, "SELECT *,tweet_text from outlier_info,tweets where outlier_info.tweet_id=tweets.tweet_id limit 500")
rt_not_outlier <- dbGetQuery(con,"Select *, tweet_text from not_outlier_info,tweets where not_outlier_info.tweet_id=tweets.tweet_id limit 500");
dbDisconnect(con)
all_tweets = rbind(rt_outlier,rt_not_outlier)
all_tweets[,"tweet_text"] <- iconv(all_tweets[,"tweet_text"], to = "utf-8")
corpus = Corpus(VectorSource(all_tweets[,"tweet_text"]))
corpus = tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus,removeWords,stopwords("english"))
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removeNumbers)
mydata.dtm <- TermDocumentMatrix(corpus,control=list(weighting=weightTfIdf, minWordLength=2, findFreqTerms=5))
dataframe <- as.data.frame(inspect(mydata.dtm))
d=as.data.frame(t(dataframe))
classData = c(rep(0,500),rep(1,500))
classData = as.factor(classData)
library('caret')
ctrl = trainControl(method = "repeatedcv", repeats = 3,)
set.seed(2)
mymodel <- train(d, classData,trControl=ctrl,method="J48",model=FALSE)
基本上,这里发生的是我不断收到错误和警告:
Error in train.default(d, classData, method = "J48", model = FALSE) :
final tuning parameters could not be determined
In addition: Warning messages:
1: In train.default(d, classData, method = "J48", model = FALSE) :
Models using Weka will not work with parallel processing with multicore/doMC
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
3: In train.default(d, classData, method = "J48", model = FALSE) :
missing values found in aggregated results
我做错了什么?另请注意,我在训练中使用model = FALSE来节省内存,因为这是一个问题
答案 0 :(得分:1)
您是否看到了消息:
“Models using Weka will not work with parallel processing with multicore/doMC
”?
尝试按顺序运行。
更好的是,并行使用C5.0
。 J48
不是C4.5
的绝佳实现,尤其是在分割分类预测变量时。
最高