Spark中的在线(增量)逻辑回归

时间:2018-04-25 16:09:25

标签: apache-spark pyspark spark-streaming apache-spark-mllib apache-spark-ml

在Spark MLlib(基于RDD的API)中,StreamingLogisticRegressionWithSGD用于Logistic回归模型的增量训练。但是,此类已被弃用并且提供的功能很少(例如,无法访问模型系数和输出概率)。

在Spark ML(基于DataFrame的API)中,我只找到类LogisticRegression,只有fit方法进行批处理培训。这不允许模型保存,重新加载和增量培训。

毋庸置疑,一些应用程序可以从增量学习中获益。 Spark中有没有可用的解决方案?

1 个答案:

答案 0 :(得分:0)

In Spark ML, when you call LogisticRegression.fit() you get a LogisticRegressionModel. You can then add the LogisticRegressionModel to a Pipeline and save/load the pipeline for incremental training.

val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(lr))
model = pipeline.fit(data)
model.write.overwrite().save("/tmp/saved_model")

If you want to train the model with streaming data or apply it to streaming data, you can define a Structured Streaming dataframe and pass it to the pipeline.

For example (taken from the Spark docs):

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")