Question

我正在寻找加速（或替换）我的算法来分组数据的方法。

我有一个numpy数组列表。我想生成一个新的numpy数组，这样每个索引的每个元素对于原始数组相同的每个索引都是相同的。（如果不是这样的话会有所不同。）

这听起来有点尴尬，所以有一个例子：

# Test values:
values = [
    np.array([10, 11, 10, 11, 10, 11, 10]),
    np.array([21, 21, 22, 22, 21, 22, 23]),
    ]

# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
#                             *           *

请注意，我标记的元素（索引0和4）具有相同的值（0），因为原始的两个数组也是相同的（即10和{{1} }）。类似于索引为3和5的元素（21）。

算法必须处理任意数量（大小相等）的输入数组，并且还为每个结果数返回它们对应的原始数组的值。（因此，对于此示例，＆＃34; 3＆＃34;指的是3。）

这是我目前的算法：

(11, 22)

请注意，对于每个单独的索引，表达式import numpy as np def groupify(values): group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped. group_meanings = {} next_hash = 0 matching = np.ones((len(values[0]),), dtype=bool) while any(group == -1): this_combo = {} matching[:] = (group == -1) first_ungrouped_idx = np.where(matching)[0][0] for curr_id, value_array in enumerate(values): needed_value = value_array[first_ungrouped_idx] matching[matching] = value_array[matching] == needed_value this_combo[curr_id] = needed_value # Assign all of the found elements to a new group group[matching] = next_hash group_meanings[next_hash] = this_combo next_hash += 1 return group, group_meanings会被多次评估，这是缓慢来自的地方。

我不确定我的算法是否可以加速，但我也不确定它是否是最佳算法。有没有更好的方法呢？

Answer 1

最终破解了它的矢量化解决方案！这是一个有趣的问题。问题是我们必须标记从列表的相应数组元素中获取的每对值。然后，我们应该根据它们在其他对中的唯一性来标记每个这样的对。因此，我们可以使用np.unique滥用所有可选参数，最后做一些额外的工作来保持最终输出的顺序。这里的实现基本上分三个阶段完成 -

# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)

# Do the heavy work with np.unique to give us :
#   1. Starting indices of unique elems, 
#   2. Srray that has unique IDs for each element in idx, and 
#   3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)

# Best part happens here : Use mask to ignore the repeated elems and re-tag 
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]

运行时测试

让我们将提出的矢量化方法与原始代码进行比较。由于建议的代码仅为我们提供了组ID，因此对于公平的基准测试，我们只需从原始代码中删除不用于提供给我们的部分。所以，这里是函数定义 -

def groupify(values):  # Original code
    group = np.zeros((len(values[0]),), dtype=np.int64) - 1
    next_hash = 0
    matching = np.ones((len(values[0]),), dtype=bool)
    while any(group == -1):

        matching[:] = (group == -1)
        first_ungrouped_idx = np.where(matching)[0][0]

        for curr_id, value_array in enumerate(values):
            needed_value = value_array[first_ungrouped_idx]
            matching[matching] = value_array[matching] == needed_value
        # Assign all of the found elements to a new group
        group[matching] = next_hash
        next_hash += 1
    return group

def groupify_vectorized(values):  # Proposed code
    arr = np.vstack(values)
    idx = np.ravel_multi_index(arr,arr.max(1)+1)
    _,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)    
    mask = ~np.in1d(unqID,np.where(count>1)[0])
    mask[unq_start_idx] = 1
    return idx[mask].argsort()[unqID]

运行时结果在包含大型数组的列表中 -

In [345]: # Input list with random elements
     ...: values = [item for item in np.random.randint(10,40,(10,10000))]

In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True

In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop

In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop

Answer 2

这应该可行，并且应该快得多，因为我们正在使用broadcasting和numpy本来快速的布尔比较：

import numpy as np

# Test values:
values = [
    np.array([10, 11, 10, 11, 10, 11, 10]),
    np.array([21, 21, 22, 22, 21, 22, 23]),
    ]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])

# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]

# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)

# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False

# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)

# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
    if np.isnan(labelled[k]):  # this element has not been labelled yet
        labelled[k] = val      # so label it
        labelled[row] = val    # and label every element satisfying the test
        val += 1

print(labelled)
# outputs [ 0.  1.  2.  3.  0.  3.  4.]

在处理两个阵列时，它比你的版本快1.5倍，但我怀疑对于更多阵列来说加速应该更好。

Answer 3

numpy_indexed包（免责声明：我是它的作者）包含numpy数组集操作的通用变体，可用于以优雅高效（矢量化）的方式解决您的问题：

import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)

以上内容适用于任意类型组合，但是，如果值是同一dtype的许多数组的列表，则下面的效率会更高：

unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)

Answer 4

如果我理解正确，您正在尝试根据列散列值。最好将列自身转换为任意值，然后从中找到哈希值。

所以你实际上想要在list(np.array(values).T)上哈希。

此功能已内置于Pandas中。你不需要写它。唯一的问题是它需要一个值列表，而不包含其他列表。在这种情况下，您只需将内部列表转换为string map(str, list(np.array(values).T))并对其进行分解！

>>> import pandas as pd
>>> pd.factorize(map(str, list(np.array(values).T)))
(array([0, 1, 2, 3, 0, 3, 4]),
 array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))

我已将数组列表转换为数组，然后转换为字符串...

快速算法查找多个数组具有相同值的索引

4 个答案: