我正在使用适用于Spark的Python API(PySpark)计算__SparseVector__
中的索引大小。
def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])
当我在__.count()__
数据帧上执行操作__count_variables__
时,出现错误:
AttributeError:'numpy.ndarray'对象没有属性'indices'
要考虑的主要部分是:
data_transformed_rdd.map(lambda row:[row [0],row [1] .indices.size])。toDF([“ id”,“ frequency”])
我相信该块与错误有关,但是我无法理解如果我通过映射以__numpy.ndarray__
为参数的__lambda expression__
来进行计算,为什么异常会告诉__SparseVector__
{1}}(由__assembler__
创建)。
有什么建议吗?有人知道我在做什么错吗?
答案 0 :(得分:1)
这里有两个问题。第一个在div {
display: flex;
flex-wrap: wrap;
}
div > p {
flex-grow: 1;
width: 33.33%;
height: 100px;
text-align: justify;
margin: 10px;
}
调用中,<div>
<p>
It was so beautiful out on the country, it was summer- the wheat fields were golden, the oats were green, and down among the green meadows the hay was stacked. There the stork minced
</p>
<p>
about on his red legs, clacking away in Egyptian, which was the language his mother had taught him. Round about the field and meadow lands rose vast forests, in which deep lakes lay hidden.
</p>
</div>
和async function fetchTopHeadlinesAsyncAwait() {
let response = await fetch(TOP_STORIES)
let storyIds = await response.json()
for(let storyId of storyIds) {
console.log(storyId)
let storyDetailsURL = `someurl/er/tg/${storyId}.json?print=pretty`
try {
let response = await fetch(storyDetailsURL)
let story = await response.json()
displayStory(story)
} catch (err) {
displayStory('An error occurred while fetching this story.');
}
}
}
是SparseVector class的两个不同属性,indices.size
是完整的向量大小,而indices
是向量索引,其值非零,但size
不是size
属性。因此,假设您所有的向量都是SparseVector类的实例:
indices
解决方案是size
函数:
indices
这是第二个问题:VectorAssembler并不总是生成SparseVectors,这取决于更有效的方式,可以生成SparseVector或DenseVectors(基于原始矢量具有的零个数)。例如,假设下一个数据帧:
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
(1, Vectors.sparse(4, [], [])),
(3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
["documento", "variables"])
df.show()
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+
文档1是DenseVector,并且previos解决方案不起作用,因为DenseVectors没有len
属性,因此您必须使用更一般的向量表示形式来处理同时包含稀疏和密集向量的DataFrame ,例如df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))\
.toDF(["documento", "variables", "frecuencia"])
df.show()
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+
:
df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
(1, Vectors.dense([1., 1., 1., 1.])),
(3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
["documento", "variables"])
df.show()
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+