我想将推文转变为用于机器学习的向量,以便我可以使用Spark的K-Means聚类根据内容对它们进行分类。例如,与亚马逊相关的所有推文都归为一类。
我已经尝试过将推文拆分为单词,并使用HashingTF创建矢量,但这并不是很成功。
还有其他矢量化推文的方法吗?
答案 0 :(得分:1)
You can try this pipeline:
First, tokenize the input Tweet (located in the column text
). basically, it creates a new column rawWords
as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false)
)
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawWords")
.setPattern("\\w+")
.setGaps(false)
Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawWords")
.setOutputCol("words")
Now it's time to vectorize the words
column. In this example I'm using the CountVectorizer
which is quite basic. There are many others such as the TF-ID Vectorizer
. You can find more information here。
我已经配置了val countVectorizer = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5.0)
.setMinTF(1.0)
,以使其创建具有10,000个单词的词汇表,每个单词在所有文档中至少出现5次,在每个文档中最少出现1次。
val transformPipeline = new Pipeline()
.setStages(Array(
tokenizer,
stopWordsRemover,
countVectorizer))
transformPipeline.fit(training).transform(test)
最后,只需创建管道,然后通过传递数据集来拟合和转换由管道生成的模型。
for
希望有帮助。