如何使用Spark的MLLib向量化推文?

时间:2018-07-29 15:12:46

标签: apache-spark vector twitter k-means apache-spark-mllib

我想将推文转变为用于机器学习的向量,以便我可以使用Spark的K-Means聚类根据内容对它们进行分类。例如,与亚马逊相关的所有推文都归为一类。

我已经尝试过将推文拆分为单词,并使用HashingTF创建矢量,但这并不是很成功。

还有其他矢量化推文的方法吗?

1 个答案:

答案 0 :(得分:1)

You can try this pipeline:

First, tokenize the input Tweet (located in the column text). basically, it creates a new column rawWords as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false))

val tokenizer = new RegexTokenizer()
 .setInputCol("text")
 .setOutputCol("rawWords")
 .setPattern("\\w+")
 .setGaps(false)

Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.

val stopWordsRemover = new StopWordsRemover()
 .setInputCol("rawWords")
 .setOutputCol("words")

Now it's time to vectorize the wordscolumn. In this example I'm using the CountVectorizerwhich is quite basic. There are many others such as the TF-ID Vectorizer. You can find more information here

我已经配置了val countVectorizer = new CountVectorizer() .setInputCol("words") .setOutputCol("features") .setVocabSize(10000) .setMinDF(5.0) .setMinTF(1.0) ,以使其创建具有10,000个单词的词汇表,每个单词在所有文档中至少出现5次,在每个文档中最少出现1次。

val transformPipeline = new Pipeline()
 .setStages(Array(
   tokenizer,
   stopWordsRemover,
   countVectorizer))

transformPipeline.fit(training).transform(test)

最后,只需创建管道,然后通过传递数据集来拟合和转换由管道生成的模型。

for

希望有帮助。