如何计算scala中分组火花数据帧的TF-IDF?

时间:2017-08-14 14:04:38

标签: scala apache-spark apache-spark-sql spark-dataframe apache-spark-mllib

我使用Spark Api(https://spark.apache.org/docs/latest/ml-features.html#tf-idf)来计算数据帧上的TF IDF。我无法做的是使用Dataframe groupBy对分组数据执行此操作,并为每个组计算TFIDF,并在结果中获取单个数据帧。

输入示例

id |  category        | texts                           
 0 |  smallLetters    | Array("a", "b", "c")            
 1 |  smallLetters    | Array("a", "b", "b", "c", "a")  
 2 |  capitalLetters  | Array("A", "B", "C")
 3 |  capitalLetters  | Array("A", "B", "B", "c", "A)

按类“类别”分组的示例输出

id | category       | texts                           | vector
0  | smallLetters   | Array("a", "b", "c")            | (3,[0,1,2],[1.0,1.0,1.0])
1  | smallLetters   | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])
2  | capitalLetters | Array("A", "B", "C")            | (3,[3,4,5],[1.0,1.0,1.0])
3  | captialLetters | Array("A", "B", "B", "c", "A)   | (5, [3,4,2],[2.0,2.0,1.0])

以火星网站为例,我目前与此类似:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(3)
  .setMinDF(2)
  .fit(df)

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")

cvModel.transform(df).show(false)

现在我面临的问题是如何在对类别进行groupby操作后使用上面的代码计算TF-IDF。

编辑:我想将语料库定义为分组数据。那就是smallLetters是一个语料库而capitalLetters是另一个,因此对于TF-IDF计算,smallLetters语料库包含2个文档,而capitalLetters包含2个文档。

0 个答案:

没有答案