Spark HashingTF结果说明

时间:2016-12-14 22:13:07

标签: apache-spark apache-spark-mllib apache-spark-ml

我在DataBricks上尝试过标准的spark HashingTF示例。

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
display(featurizedData)

我对下面的理解结果很不满意。 Please see the image 当numFeatures为20

[0,20,[0,5,9,17],[1,1,1,2]]
[0,20,[2,7,9,13,15],[1,1,3,1,1]]
[0,20,[4,6,13,15,18],[1,1,1,1,1]]

如果[0,5,9,17]是哈希值
和[1,1,1,2]是频率 17具有频率2
9有3(有2)
13,15有1,而他们必须有2.

可能我错过了一些东西。找不到详细解释的文档。

2 个答案:

答案 0 :(得分:1)

您的猜测是正确的:

  • 20 - 是矢量大小
  • 第一个列表是索引列表
  • 第二个列表是值列表

前导0只是内部表示的工件。

这里没有什么可以学习的。

答案 1 :(得分:1)

正如mcelikkaya所说,输出频率不是你所期望的。这是由于少数特征的哈希冲突,在这种情况下为20。我在输入数据中添加了一些单词(用于说明目的),并将功能增加到20,000,然后生成正确的频率:

+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|label|sentence                                                 |words                                                                    |rawFeatures                                                                           |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|0    |Hi hi hi hi I i i i i heard heard heard about Spark Spark|[hi, hi, hi, hi, i, i, i, i, i, heard, heard, heard, about, spark, spark]|(20000,[3105,9357,11777,11960,15329],[2.0,3.0,1.0,4.0,5.0])                           |
|0    |I i wish Java could use case classes spark               |[i, i, wish, java, could, use, case, classes, spark]                     |(20000,[495,3105,3967,4489,15329,16213,16342,19809],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0])|
|1    |Logistic regression models are neat                      |[logistic, regression, models, are, neat]                                |(20000,[286,1193,9604,13138,18695],[1.0,1.0,1.0,1.0,1.0])                             |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+------------------------------------------------------------