我正在尝试使用Naive-bayes算法构建文本分类模型。
这是我的样本数据(标签和功能):
1|combusting [chemical]
1|industrial purposes
1|
2|salt for preserving,
2|other for foodstuffs
2|auxiliary
2|fluids for use with abrasives
3|vulcanisation
3|accelerators
3|anti-frothing solutions for batteries
4|anti-frothing solutions for accumulators
4|acetates
4|[chemicals]*
4|acetate of cellulose, unprocessed
以下是我的示例代码
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.feature.HashingTF
val rawData = sc.textFile("~/data.csv")
val rawData1 = rawData.map(x => x.replaceAll(",",""))
val htf = new HashingTF(1000)
val parsedData = rawData1.map { line =>
val values = (line.split("|").toSeq)
val featureVector = htf.transform(values(1).split(" "))
val label = values(0).toDouble
LabeledPoint(label, featureVector)
}
val splits = parsedData.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 2.0, modelType = "multinomial")
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new MulticlassMetrics(predictionAndLabels)
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
val testData1 = htf.transform("salt")
val predictionAndLabels1 = model.predict(testData1)
我的准确度大约为33%(非常低),测试数据预测错误的标签。我打印了包含标签和功能的parsedData,如下所示:
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
我无法找到失踪的东西;散列项频率函数似乎产生重复的数据项频率。请建议我提高模型性能,提前致谢
答案 0 :(得分:0)
在开始实施算法之前,你必须先问自己很多问题:
ctrl + alt + dlt
通常,使用Pipelines和CrossValidation与Naive Bayes一起使用进行多类分类,因此请查看here,而不是用手硬编码所有septs。