PySpark 2.2.0:'numpy.ndarray'对象没有属性'indices'

时间:2019-03-07 22:03:34

标签: python pyspark

任务

我正在使用适用于Spark的Python API(PySpark)计算__SparseVector__中的索引大小。

脚本

def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

问题

当我在__.count()__数据帧上执行操作__count_variables__时,出现错误:

  

AttributeError:'numpy.ndarray'对象没有属性'indices'

要考虑的主要部分是:

  

data_transformed_rdd.map(lambda row:[row [0],row [1] .indices.size])。toDF([“ id”,“ frequency”])

我相信该块与错误有关,但是我无法理解如果我通过映射以__numpy.ndarray__为参数的__lambda expression__来进行计算,为什么异常会告诉__SparseVector__ {1}}(由__assembler__创建)。

有什么建议吗?有人知道我在做什么错吗?

1 个答案:

答案 0 :(得分:1)

这里有两个问题。第一个在div { display: flex; flex-wrap: wrap; } div > p { flex-grow: 1; width: 33.33%; height: 100px; text-align: justify; margin: 10px; }调用中,<div> <p> It was so beautiful out on the country, it was summer- the wheat fields were golden, the oats were green, and down among the green meadows the hay was stacked. There the stork minced </p> <p> about on his red legs, clacking away in Egyptian, which was the language his mother had taught him. Round about the field and meadow lands rose vast forests, in which deep lakes lay hidden. </p> </div>async function fetchTopHeadlinesAsyncAwait() { let response = await fetch(TOP_STORIES) let storyIds = await response.json() for(let storyId of storyIds) { console.log(storyId) let storyDetailsURL = `someurl/er/tg/${storyId}.json?print=pretty` try { let response = await fetch(storyDetailsURL) let story = await response.json() displayStory(story) } catch (err) { displayStory('An error occurred while fetching this story.'); } } } SparseVector class的两个不同属性,indices.size是完整的向量大小,而indices是向量索引,其值非零,但size不是size属性。因此,假设您所有的向量都是SparseVector类的实例:

indices

解决方案是size函数:

indices

这是第二个问题:VectorAssembler并不总是生成SparseVectors,这取决于更有效的方式,可以生成SparseVector或DenseVectors(基于原始矢量具有的零个数)。例如,假设下一个数据帧:

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
                            (1, Vectors.sparse(4, [], [])),
                            (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
                           ["documento", "variables"])

df.show()

+---------+--------------------+
|documento|           variables|
+---------+--------------------+
|        0|(4,[0,1],[11.0,2.0])|
|        1|           (4,[],[])|
|        3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

文档1是DenseVector,并且previos解决方案不起作用,因为DenseVectors没有len属性,因此您必须使用更一般的向量表示形式来处理同时包含稀疏和密集向量的DataFrame ,例如df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))\ .toDF(["documento", "variables", "frecuencia"]) df.show() +---------+--------------------+----------+ |documento| variables|frecuencia| +---------+--------------------+----------+ | 0|(4,[0,1],[11.0,2.0])| 2| | 1| (4,[],[])| 0| | 3|(4,[0,1,2],[2.0,2...| 3| +---------+--------------------+----------+

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
                             (1, Vectors.dense([1., 1., 1., 1.])),
                              (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))], 
                           ["documento", "variables"])

df.show()      
+---------+--------------------+
|documento|           variables|
+---------+--------------------+
|        0|(4,[0,1],[11.0,2.0])|
|        1|   [1.0,1.0,1.0,1.0]|
|        3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+