我已经标记了某些组编号的矢量(LabeledPoint-s)。对于每个组,我需要创建单独的 Logistic回归分类器:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object Scratch {
val train = Seq(
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((1, 1.5), (2, 4.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 2.0), (1, 1.0), (2, 3.5))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 3.0), (2, 7.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (1, 3.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.5), (2, 4.0)))))
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val trainRDD = sc.parallelize(train)
val modelByGroup = trainRDD.groupByKey().map({case (group, iter) =>
(group, new LogisticRegressionWithLBFGS().run(iter))})
更新 - 演示嵌套RDD迭代不起作用:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object Scratch {
val train = Seq(
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((1, 1.5), (2, 4.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 2.0), (1, 1.0), (2, 3.5))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 3.0), (2, 7.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (1, 3.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.5), (2, 4.0)))))
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val trainRDD = sc.parallelize(train)
val keys : RDD[Int] = trainRDD.map({case (key,_) => key}).distinct
for (key <- keys) {
// key is Int here!
// Get train data for the current group (key):
val groupTrain = trainRDD.filter({case (x, _) => x == key }).cache()
* Which results in org.apache.spark.SparkException:
* RDD transformations and actions can only be invoked by the driver,
* not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid
* because the values transformation and count action cannot be performed inside of the rdd1.map transformation.
* For more information, see SPARK-5063. at org.apache.spark.rdd.RDD.sc(RDD.scala:87)
答案 0 :(得分:3)
如果您在每个组上使用分类器,则不需要mllib。 Mllib旨在与分布式集合一起使用(您的集合并不是每个工作者都有本地集合)。您可以在map函数中的每个组中使用一些本地机器学习库,如weka。
val keys = wholeRDD.map(_._1).distinct.collect
var models = List()
for (key <- keys) {
val valuesForKey = wholeRDD.filter(_._1 == key)
// train model
models = model::models