我是R的新手并且正在研究机器学习问题,我知道机器学习需要标记数据才能做出准确的预测。
我正在处理文本数据,其中用户针对特定移动应用以文本格式提供评论。我的主要任务是首先提取主要关键字(功能)
“审核”栏
中的CSV文件中的文字数据如下review1 - "the gps does not work",
review2 - "tracking of phone is inconsistent",
review3 - "the battery is draining fast",
review4 - "the tracks disappear after some time",
review5 - "the app consumes the battery lot because of gps"
现在我想提取每个评论中提到的功能,例如 “gps”,“跟踪”,“电池”,“曲目”,“电池gps”,并将其作为标签分别添加到CSV文件中;因此,CSV文件中会再创建一列作为“功能”。 因此,我的CSV将有2列,一个评论和一个功能列,将突出显示评论中提到的功能.CSV中数据的快照如下new csv file data
我已经编写了下面提到的示例代码但由于我需要处理成千上万的评论,我需要在我的csv文件中获取Feature列,它将作为特征预测的标签
#Feature Prediction
library(tm)
library(e1071)
texts <- c("the gps does not work",
"tracking of phone is inconsistent",
"the battery is draining fast",
"the tracks disappear after some time",
"the app consumes the battery a lot")
features <- c("gps", "tracking", "battery", "tracks","battery")
docs <- VCorpus(VectorSource(texts))
# Clean corpus
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- DocumentTermMatrix(docs)
# Transform dtm to matrix to data frame - df is easier to work with
mat.df <- as.data.frame(data.matrix(dtm), stringsAsfactors = FALSE)
# Column bind category (known classification)
mat.df <- cbind(mat.df, features)
View(mat.df)
# Split data by rownumber into two equal portions (Train and Test Data)
train <- sample(nrow(mat.df), ceiling(nrow(mat.df) * .50))
test <- (1:nrow(mat.df))[- train]
# Isolate classifier
cl <- mat.df[, "features"]
# Create model data and remove "features"
modeldata <- mat.df[,!colnames(mat.df) %in% "features"]
feature_pred <- naiveBayes(modeldata[train,], cl[train])
naiv_pred <- predict(feature_pred, modeldata[test,])
conf.mat <- table("Predictions" = naiv_pred, Actual = cl[test])
conf.mat
(accuracy <- sum(diag(conf.mat))/length(test) * 100)