我使用Spark Api(https://spark.apache.org/docs/latest/ml-features.html#tf-idf)来计算数据帧上的TF IDF。我无法做的是使用Dataframe groupBy对分组数据执行此操作,并为每个组计算TFIDF,并在结果中获取单个数据帧。
输入示例
id | category | texts
0 | smallLetters | Array("a", "b", "c")
1 | smallLetters | Array("a", "b", "b", "c", "a")
2 | capitalLetters | Array("A", "B", "C")
3 | capitalLetters | Array("A", "B", "B", "c", "A)
按类“类别”分组的示例输出
id | category | texts | vector
0 | smallLetters | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | smallLetters | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
2 | capitalLetters | Array("A", "B", "C") | (3,[3,4,5],[1.0,1.0,1.0])
3 | captialLetters | Array("A", "B", "B", "c", "A) | (5, [3,4,2],[2.0,2.0,1.0])
以火星网站为例,我目前与此类似:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features")
cvModel.transform(df).show(false)
现在我面临的问题是如何在对类别进行groupby操作后使用上面的代码计算TF-IDF。
编辑:我想将语料库定义为分组数据。那就是smallLetters是一个语料库而capitalLetters是另一个,因此对于TF-IDF计算,smallLetters语料库包含2个文档,而capitalLetters包含2个文档。