Question

我有一个numpy数组，我需要通过将行与重复条目（基于第一列）组合来合并，同时保留其他列的任何正值。我的阵列看起来像这样。

array([[117,   0,   1,   0,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [117,   0,   0,   0,   0,   1],
       [120,   0,   1,   0,   0,   0],
       [189,   0,   0,   0,   1,   0],
       [117,   1,   0,   0,   0,   0],
       [120,   0,   0,   1,   0,   0]])

我试图让输出看起来像这样：

array([[117,   1,   1,   0,   0,   1],
       [120,   0,   1,   1,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [189,   0,   0,   0,   1,   0]])

我已经能够在第0列使用unique来过滤掉重复项，但我似乎无法保留其他列的值。我很感激任何输入！

Answer 1

纯粹的NumPy解决方案可以像这样工作（我已经命名了你的起始数组a）：

>>> b = a[np.argsort(a[:, 0])]
>>> grps, idx = np.unique(b[:, 0], return_index=True)
>>> counts = np.add.reduceat(b[:, 1:], idx)
>>> np.column_stack((grps, counts))
array([[117,   1,   1,   0,   0,   1],
       [120,   0,   1,   1,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [189,   0,   0,   0,   1,   0]])

这将按排序顺序（按标签）返回行。

pandas中的解决方案可以用更少的行（并且可能使用比NumPy方法更少的额外内存）：

>>> df = pd.DataFrame(a)
>>> df.groupby(0, sort=False, as_index=False).sum().values
array([[117,   1,   1,   0,   0,   1],
       [163,   1,   0,   0,   0,   0],
       [120,   0,   1,   1,   0,   0],
       [189,   0,   0,   0,   1,   0]])

sort=False参数表示按照首次遇到唯一标签的顺序返回行。

Answer 2

如果你不介意随机重新排序行，那么字典哈希就可以了。

def consolidate(input):
    unique = { }
    for row in input:
        id = row[0]
        if id not in unique:
            unique[id] = row
        else:
            for i in range(1, len(row)):
                unique[id][i] |= row[i]
    return unique.values()

这导致： -

[[120, 0, 1, 1, 0, 0],
 [163, 1, 0, 0, 0, 0],
 [117, 1, 1, 0, 0, 1],
 [189, 0, 0, 0, 1, 0]]

如果你做想要保留行序列，那么还需要做一些工作： -

def consolidate(input):
    unique = { }
    sequence = 0

    for row in input:
        id = row[0]
        row = [sequence] + row
        sequence += 1
        if id not in unique:
            unique[id] = row
        else:
            for i in range(2, len(row)):
                unique[id][i] |= row[i]
    return [row[1:] for row in sorted(unique.values())]

现在导致： -

[[117, 1, 1, 0, 0, 1],
 [163, 1, 0, 0, 0, 0],
 [120, 0, 1, 1, 0, 0],
 [189, 0, 0, 0, 1, 0]]

合并数组

2 个答案: