我刚开始使用spark ML管道来实现一个使用LogisticRegressionWithLBFGS的多类分类器(它接受作为参数的类数)
我按照这个例子:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}
case class LabeledDocument(id: Long, text: String, label: Double)
case class Document(id: Long, text: String)
val conf = new SparkConf().setAppName("SimpleTextClassificationPipeline")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// Prepare training documents, which are labeled.
val training = sc.parallelize(Seq(
LabeledDocument(0L, "a b c d e spark", 1.0),
LabeledDocument(1L, "b d", 0.0),
LabeledDocument(2L, "spark f g h", 1.0),
LabeledDocument(3L, "hadoop mapreduce", 0.0)))
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training.toDF)
// Prepare test documents, which are unlabeled.
val test = sc.parallelize(Seq(
Document(4L, "spark i j k"),
Document(5L, "l m n"),
Document(6L, "mapreduce spark"),
Document(7L, "apache hadoop")))
// Make predictions on test documents.
model.transform(test.toDF)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println("($id, $text) --> prob=$prob, prediction=$prediction")
}
sc.stop()
问题是ML使用的LogisticRegression类默认使用2个类(第176行):覆盖val numClasses:Int = 2
知道如何解决这个问题吗?
由于
答案 0 :(得分:1)
正如Odomontois所述,如果您想使用Spark ML Pipelines使用基本NLP管道,您只有两个选项:
new OneVsRest().setClassifier(logisticRegression)
CountVectorizer
)和支持多类分类的NaiveBayes
分类器答案 1 :(得分:0)
但是你的测试样本只有两个类..为什么它会在" auto"模式?你可以强制拥有多项分类:
val family: Param[String]
Param for the name of family which is a description of the label distribution to be used in the model. Supported options:
"auto": Automatically select the family based on the number of classes: If numClasses == 1 || numClasses == 2, set to "binomial". Else, set to "multinomial"
"binomial": Binary logistic regression with pivoting.
"multinomial": Multinomial logistic (softmax) regression without pivoting. Default is "auto".