我在Spark / Scala中执行一个朴素的贝叶斯分类。它似乎工作正常,代码是:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
val dfLemma2 = dfLemma.withColumn("racist", 'racist.cast("String"))
val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")
val indexed = indexer.fit(dfLemma2).transform(dfLemma2)
indexed.show()
val hashingTF = new HashingTF()
.setInputCol("lemma").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(indexed)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("features", "indexracist").take(3).foreach(println)
val changedTypedf = rescaledData.withColumn("indexracist", 'indexracist.cast("double"))
changedTypedf.show()
// val labeled = changedTypedf.map(row => LabeledPoint(row(0), row.getAs[Vector](4)))
val labeled = changedTypedf.select("indexracist","features").rdd.map(row => LabeledPoint(
row.getAs[Double]("indexracist"),
org.apache.spark.mllib.linalg.Vectors.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))
))
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
// Split data into training (60%) and test (40%).
val Array(training, test) = labeled.randomSplit(Array(0.6, 0.4))
val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")
val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
predictionAndLabel.take(100)
输出:
res330: Array[(Double, Double)] = Array((0.0,0.0), (0.0,0.0), (0.0,0.0), (0.0,0.0),
我假设是(预测,标签)对的数组。 我想输出的是这些对加入原始文本,这是一个名为引理的训练数据框,如下所示:
--------------------------------------------------
| Prediction | Label | lemma |
--------------------------------------------------
| 0.0 | 0.0 |[cakes, are, good] |
| 0.0 | 0.0 |[jim, says, hi] |
| 1.0 | 1.0 |[shut, the, dam, door]|
...
--------------------------------------------------
任何指针都会受到赞赏,因为我的Spark / Scala很弱。
编辑,文本栏名为“lemma”#39;在'索引':
+------+-------------------------------------------------------------------------------------------------------------------+
|racist|lemma |
+------+-------------------------------------------------------------------------------------------------------------------+
|true |[@cllrwood, abbo, @ukip, britainfirst] |
|false |[objectofthemonth, george, lansbury, bust, jussuf, abbo, amp, fascinating, insight, son, jerome] |
|false |[nowplay, one, night, stand, van, brave, @bbraveofficial, bbravesquad, abbo, safe] |
|false |[@mahesh, weet, son, satyamurthy, kante, abbo, chana, better, aaamovie] |
答案 0 :(得分:1)
您只需转换数据并按以下方式显示:
val predictions = model.transform(test)
predictions.show()
答案 1 :(得分:1)
尝试使用ml包而不是mllib包。例如,参考https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.SparkSession
object NaiveBayesExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("NaiveBayesExample")
.getOrCreate()
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.6, 0.4))
// Train a NaiveBayes model.
val model = new NaiveBayes()
.fit(trainingData)
// Select example rows to display.
val predictions = model.transform(testData)
predictions.show()
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test set accuracy = " + accuracy)
spark.stop()
}
}
答案 2 :(得分:1)
正如一些答案所说:建议您使用ml
包而不是mllib
包,因为spark 2.0
使用ml
包重写代码后,您的问题的答案将非常简单:只需选择正确的列就可以满足您的需求
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}
val dfLemma2 = dfLemma.withColumn("racist", 'racist.cast("String"))
val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")
val hashingTF = new HashingTF()
.setInputCol("lemma")
.setOutputCol("rawFeatures")
.setNumFeatures(20)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val naiveBayes =
new NaiveBayes().setLabelCol("indexracist").setFeaturesCol("features").setModelType("multinomial").setSmoothing(1.0)
val pipeline = new Pipeline().setStages(Array(indexer, hashingTF, idf, naiveBayes))
val Array(training, test) = dfLemma2.randomSplit(Array(0.6, 0.4))
val model = pipeline.fit(training)
val predictionAndLabel = model.transform(test).select('Prediction, 'racist, 'indexracist, 'lemma)
predictionAndLabel.take(100)
希望它有所帮助,否则评论你的问题
答案 3 :(得分:0)
在选择要显示的输出列时,尝试包含"引理"列也是这样,它将与标签和功能列一起写入。
有关详细信息,请参阅How to create correct data frame for classification in Spark ML。这篇文章与你的问题有点相似,并检查是否有帮助
答案 4 :(得分:-2)
我们必须使用管道来获取预测列以外的训练列