我通过索引向量和值向量表示具有大量重复元素的向量,以便索引向量包含向量中的值变化的索引,并且值向量包含这些索引处的值。
例如:[1, 1, 1, 7, 7, 4, 4]
由索引向量[0, 3, 5]
和值向量[1, 7, 4]
表示。
我有一个添加两个这样的矢量的算法,但感觉应该可以更清洁,更快速地完成。 有没有更好的方法在python / numpy中做到这一点?
当前算法结合了两个数组的差异,最后做了一个cumsum。但这需要大量清理才能确保没有任何重复的索引条目,并且没有连续的相等值。
import numpy as np
class SparseVector:
def __init__(self, indices, values, sanitize=False):
self._indices = np.asanyarray(indices)
self._values = np.asanyarray(values)
if sanitize:
self._sanitize()
def __add__(self, other):
all_indices = np.r_[self._indices, other._indices]
args = np.argsort(all_indices, kind='mergesort')
diffs = np.r_[self._values[0], np.diff(self._values),
other._values[0], np.diff(other._values)]
diffs = diffs[args]
return self.__class__(all_indices[args], np.cumsum(diffs), True)
def _sanitize(self):
# Remove duplicate indexes
index_diffs = np.ediff1d(self._indices, to_end=1)
changes = index_diffs != 0
self._indices = self._indices[changes]
self._values = self._values[changes]
# Remove duplicated values
value_diffs = np.ediff1d(self._values, to_begin=1)
changes = value_diffs != 0
self._indices = self._indices[changes]
self._values = self._values[changes]
def __eq__(self, other):
return np.all(self._indices == other._indices)\
and np.all(self._values == other._values)
a = SparseVector([0, 1, 3, 5, 7, 10], [0, 10, 5, 8, 10, 11])
b = SparseVector([0, 2, 3, 6, 9, 10], [3, 2, 9, 7, 11, 10])
assert a+b == SparseVector([0, 1, 2, 3, 5, 6, 7, 9],
[3, 13, 12, 14, 17, 15, 17, 21])