我正在尝试使用Spark的MLLib实现单词矢量化。我正在关注给定here的例子。
我有一堆句子,我想作为输入来训练模型。但我不确定这个模型是否需要句子或只是将所有单词作为一个字符串序列。
我的输入如下:
scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ...
但是当我尝试在这个输入上训练我的word2vec模型时它不起作用。
scala> val word2vec = new Word2Vec()
word2vec: org.apache.spark.mllib.feature.Word2Vec = org.apache.spark.mllib.feature.Word2Vec@51567040
scala> val model = word2vec.fit(v)
java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be > 0. You may need to check the setting of minCount, which could be large enough to remove all your words in sentences.
Word2Vec
不会将句子作为输入吗?
答案 0 :(得分:4)
您的输入是正确的。但是,Word2Vec
会自动删除词汇表中未出现次数最少的单词(所有句子合并)。默认情况下,此值为5.在您的情况下,很可能在您使用的数据中不会出现任何单词5次或更多次。
要更改所需的最低单词出现次数,请使用setMinCount()
,例如最小计数为2:
val word2vec = new Word2Vec().setMinCount(2)