在groupByKey()在Pyspark Shell中工作但不在Spark-Submit中工作之后总结DenseVectors

时间:2016-06-28 08:50:24

标签: python apache-spark pyspark

以下是我尝试做的一些示例代码:

首先,我使用Word2Vec构建句子特征向量:

from pyspark.ml.feature import Word2Vec

# Input data: Each row is a bag of words from a sentence or document.
documentDF = sqlContext.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])
# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
result = model.transform(documentDF)

Converting output result to an RDD:
result_rdd=result.select("result").rdd
rdd_with_sample_ids_attached = result_rdd.map(lambda x: (1, x[0]))
rdd_with_sample_ids_attached.collect()

输出: [(1,DenseVector([0.0472,-0.0078,0.0377])),(1,DenseVector([ - 0.0253,-0.0171,0.0664])),(1,DenseVector([0.0101,0.0324,0.0158]))] < / p>

现在,我执行groupByKey()并找到每个组中DenseVectors的总和,如下所示:

rdd_sum = rdd_with_sample_ids_attached.groupByKey().map(lambda x: (x[0], sum(x[1])))
rdd_sum.collect()

输出: [(1,DenseVector([0.0319,0.0075,0.1198]))]

如图所示,此代码在pyspark shell中完美运行。但是,当我提交相同的代码作为spark-submit时,我收到以下错误:

File "/mnt1/yarn/usercache/hadoop/appcache/application_1465567204576_0170/container_1465567204576_0170_01_000002/pyspark.zip/pyspark/sql/functions.py", line 39, in _
   jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

我尝试将RDD重新分区到单个分区,同样的错误。 有什么帮助吗?

1 个答案:

答案 0 :(得分:1)

想出来! 问题是我的脚本中有一个导入函数,如下所示:

from pyspark.sql.functions import *

这导入了sum()函数,它取代了内置的pythonic sum()。当我删除此导入功能时,它正常工作。当pythonic内置sum()函数能够添加DenseVectors时,pyspark.sql.functions中导入的sum()不能这样做。