Question

我正在尝试使用spark的实现来训练word2vec模型。我正在遵循Spark的文档教程，但我不断收到错误消息“数据应为字符串列表的RDD”

我的数据已经过处理，以删除大写字母，停用词和标点符号。一个例子如下：

[Row(removed=['manakamana', 'doesnt', 'answer', 'questions', 'yet', 'makes', 'point', 'nepal', 'like', 'rest', 'planet', 'picturesque', 'far', 'peaceable', 'kingdom']),
 Row(removed=['wilfully', 'offensive', 'powered', 'chestthumping', 'machismo', 'good', 'clean', 'fun']),
 Row(removed=['difficult', 'imagine', 'material', 'wrong', 'spade', 'lost', 'found']),
 Row(removed=['despite', 'gusto', 'star', 'brings', 'role', 'hard', 'ride', 'shotgun', 'hectors', 'voyage', 'discovery'])]

我的困惑是文档中说word2vec接受pyspark.sql.dataframe的输入，这是我的数据集。另一方面，有人说这应该是RDD。我还尝试了以下代码：

removed2 = removed.rdd

但是我仍然遇到相同的错误（“构造ClassDict的预期零参数”）。我甚至尝试将其设置为

removed3 = removed.rdd.map(list)

给出

[[['manakamana',
   'doesnt',
   'answer',
   'questions',
   'yet',
   'makes',
   'point',
   'nepal',
   'like',
   'rest',
   'planet',
   'picturesque',
   'far',
   'peaceable',
   'kingdom']],
 [['wilfully',
   'offensive',
   'powered',
   'chestthumping',
   'machismo',
   'good',
   'clean',
   'fun']],
 [['difficult', 'imagine', 'material', 'wrong', 'spade', 'lost', 'found']]

但现在出现错误“ java.util.ArrayList无法转换为java.lang.String”

似乎还有另一个问题，就是github示例https://github.com/apache/spark/blob/master/examples/src/main/python/ml/word2vec_example.py不再起作用（也将RDD作为列表接收为字符串错误）。

我对word2vec接受的输入格式很困惑。

错误：数据应为字符串列表的RDD，但我的输入数据似乎正确（用于训练word2vec pyspark）

0 个答案: