Question

我正在使用pyspark 1.6.3通过Zeppelin和python 3.5。

我正在尝试使用pyspark CountVectorizer和LDA函数实现Latent Dirichlet分配。首先，问题是：这是我正在使用的代码。让df成为火花数据框，并在列中标记化文字＆＃39;标记化＆＃39;

vectors = 'vectors'
cv = CountVectorizer(inputCol = 'tokenized', outputCol = vectors)
model = cv.fit(df)
df = model.transform(df)

corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
ldaModel = LDA.train(corpus, k=25)

此代码或多或少取自pyspark api docs。在致LDA时，我收到以下错误：

net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

internet告诉我这是由于类型不匹配造成的。

让我们看一下LDA和CountVectorizer的类型。从spark文档中，这是另一个example稀疏向量进入LDA：

>>> from pyspark.mllib.linalg import Vectors, SparseVector
>>> data = [
...     [1, Vectors.dense([0.0, 1.0])],
...     [2, SparseVector(2, {0: 1.0})],
... ]
>>> rdd =  sc.parallelize(data)
>>> model = LDA.train(rdd, k=2, seed=1)

我自己实现这个，这就是rdd的样子：

>> testrdd.take(2)

[[1, DenseVector([0.0, 1.0])], [2, SparseVector(2, {0: 1.0})]]

另一方面，如果我转到原始代码并使用corpus的输出查看CountVectorizer rdd，我看到（已编辑以删除多余的位）：

>> corpus.take(3)

[[0, Row(vectors=SparseVector(130593, {0: 30.0, 1: 13.0, ...
 [1, Row(vectors=SparseVector(130593, {0: 52.0, 1: 44.0, ...
 [2, Row(vectors=SparseVector(130593, {0: 14.0, 1: 6.0, ...
]

所以我使用的例子（来自文档！）并没有产生（索引，SparseVector）的元组，而是（索引，行（SparseVector））......或其他东西？

问题：

SparseVector周围的Row包装器导致此错误的原因是什么？
如果是这样，我如何摆脱Row对象？ Row是df的属性，但我使用df.rdd转换为rdd;我还需要做什么？

Answer 1

这可能是个问题。只需从vectors对象中提取Row。

corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]['vectors']]).cache()

如何在pyspark中将SparseVectors传递给`mllib`

1 个答案: