Question

我想使用pyspark accumulator添加从rdd推断出的值的矩阵。我发现这些文档有点不清楚。添加一些背景，以防万一相关。
我的rddData包含索引列表，其中一个计数必须添加到矩阵中。例如，此列表映射到索引：
[1,3,4] -> (11), (13), (14), (33), (34), (44)

现在，这是我的累加器：

from pyspark.accumulators import AccumulatorParam
class MatrixAccumulatorParam(AccumulatorParam):
    def zero(self, mInitial):
        import numpy as np
        aaZeros = np.zeros(mInitial.shape)
        return aaZeros

    def addInPlace(self, mAdd, lIndex):
        mAdd[lIndex[0], lIndex[1]] += 1
        return mAdd

所以这是我的mapper函数：

def populate_sparse(lIndices):
    for i1 in lIndices:
        for i2 in lIndices:
            oAccumilatorMatrix.add([i1, i2])

然后运行数据：

oAccumilatorMatrix = oSc.accumulator(aaZeros, MatrixAccumulatorParam())

rddData.map(populate_sparse).collect()

现在，当我查看我的数据时：

sum(sum(oAccumilatorMatrix.value))
#= 0.0

它不应该是。我错过了什么？

修改首先用稀疏矩阵尝试这个，得到了不支持稀疏矩阵的回溯。改变密集numpy矩阵的问题：

...

    raise IndexError("Indexing with sparse matrices is not supported"
IndexError: Indexing with sparse matrices is not supported except boolean indexing where matrix and index are equal shapes.

Answer 1

啊哈！我想我明白了。累积器在一天结束时仍然需要将自己的碎片添加到自身。因此，请将addInPlace更改为：

def addInPlace(self, mAdd, lIndex):
    if type(lIndex) == list:
        mAdd[lIndex[0], lIndex[1]] += 1
    else:
        mAdd += lIndex
    return mAdd

所以现在它在给出列表时添加索引，并在populate_sparse函数循环后添加自己以创建我的最终矩阵。

pyspark矩阵累加器

1 个答案: