Spark 2.x - 使用word2vec或HashingTF运行Logistic

时间:2018-03-09 20:41:52

标签: apache-spark logistic-regression word2vec

我正在使用https://spark.apache.org/docs/2.2.0/ml-pipeline.html处提供的以下代码运行Logistic回归(示例:管道)

Link的原始代码......

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

val training = spark.createDataFrame(Seq(
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 0.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 0.0)
  )).toDF("id", "text", "label")
  val test = spark.createDataFrame(Seq(
    (4L, "spark i j k"),
    (5L, "l m n"),
    (6L, "spark hadoop spark"),
    (7L, "apache hadoop")
  )).toDF("id", "text")

  // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
  val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
  val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
  val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
  val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF,lr))
  val model = pipeline.fit(training)
  model.transform(test).show()

发生了两件事 - 有人可以帮忙解释一下吗

  1. 如果我将setNumFeatures减少到10.即setNumFeatures(10),那么算法会将测试中的id 5预测为1.我认为这可能是因为哈希冲突。

    < / LI>
  2. 当我将我的代码更改为word2vec而不是hashingTF

  3. val tokenizer = new Tokenizer(
    ).setInputCol("text"
    ).setOutputCol("words")
    
    val word2Vec = new Word2Vec(
    ).setInputCol(tokenizer.getOutputCol
    ).setOutputCol("features").setVectorSize(1000).setMinCount(0)
    
    val lr = new LogisticRegression(
    ).setMaxIter(10).setRegParam(0.001)
    
    val pipeline = new Pipeline(
    ).setStages(Array(tokenizer, word2Vec,lr))
    
    val model = pipeline.fit(training)
    
    model.transform(test).show()

    即使在VectorSize 1000,这也给我id 5预测为1。我还注意到id = 5的列“features”全为零。当我将测试数据更改为以下时,它正确预测

    val test = spark.createDataFrame(Seq(
        (4L, "spark i j k"),
        (5L, "l d"),
        (6L, "spark hadoop spark"),
        (7L, "apache hadoop")
      )).toDF("id", "text")

    问题 1.在我的测试数据可能不包含列车数据中的单词的情况下,运行logisticRegression的最佳方法是什么。 在这样的情况下,hashingTF会优于word2vec 3.设置的逻辑是什么 - setNumFeatures和setVectorSize

0 个答案:

没有答案