Question

我有多个数组，对应于时间序列的数据参数。数据参数包括速度，发生时间，发生日期，发生月份，发生的经过时间等。

我试图找到与从最高到最低出现频率的指定数据参数的分组相对应的索引。

举个简单的例子，请考虑以下事项：

import numpy as np

speed = np.array([4, 6, 8, 3, 6, 9, 7, 6, 4, 3])*100
elap_hr = sorted(np.random.randint(low=1, high=40, size=10))
## ... other time parameter arrays

print(speed)
# [400 600 800 300 600 900 700 600 400 300]

print(elap_hr)
# [ 1  2  6  7 13 19 21 28 33 38]

所以观察speed = 400（2次出现）对应elapsed hours = 1, 33; speed = 600（3次出现）对应elapsed hours = 2, 13, 28。

对于这个例子，假设我有兴趣按发生频率对speed进行分组。一旦我有从最高到最低频率组speed的索引，我可以在其他数据参数数组上应用相同的索引（如elap_hr）。

我首先排序和argsort speed;然后我找到了排序speed的独特元素。我将这些组合起来，找到与排序的唯一speed对应的排序speed的索引，这些索引在排序的唯一speed中按每个值分组为子数组。

def get_sorted_data(data, sort_type='default'):
    if sort_type == 'default':
        res = sorted(data)
    elif sort_type == 'argsort':
        res = np.argsort(data)
    elif sort_type == 'by size':
        res = sorted(data, key=len)
    return res

def sort_data_by_frequency(data):
    uniq_data = np.unique(data)
    sorted_data = get_sorted_data(data)
    res = [np.where(sorted_data == uniq_data[i])[0] for i in range(len(uniq_data))]
    res = get_sorted_data(res, 'by size')[::-1]
    return res 

sorted_speed = get_sorted_data(speed)
argsorted_speed = get_sorted_data(speed, 'argsort')
freqsorted_speed = sort_data_by_frequency(speed)

print(sorted_speed)
# [300, 300, 400, 400, 600, 600, 600, 700, 800, 900]
print(argsorted_speed)
# [3 9 0 8 1 4 7 6 2 5]
print(freqsorted_speed)
# [array([4, 5, 6]), array([2, 3]), array([0, 1]), array([9]), array([8]), array([7])]

在freqsorted_speed中，第一个子数组[4, 5, 6]对应于已排序数组中元素[600, 600, 600]的索引。

到目前为止，这很好。但是，我希望索引适用于所有数据参数数组。所以，我需要将argsorted索引映射到原始数组索引。

def get_dictionary_mapping(keys, values):
    ## since all indices are unique, there is no worry about identical keys
    return dict(zip(keys, values))

idx_orig = np.array([i for i in range(len(argsorted_speed))], dtype=int)
index_to_index_map = get_dictionary_mapping(idx_orig, argsorted_speed)

print(index_to_index_map)
# {0: 3, 1: 9, 2: 0, 3: 8, 4: 1, 5: 4, 6: 7, 7: 6, 8: 2, 9: 5}

print(speed[idx_orig])
# [400 600 800 300 600 900 700 600 400 300]

print(speed[argsorted_speed])
# [300 300 400 400 600 600 600 700 800 900]

print([index_to_index_map[idx_orig[i]] for i in range(len(idx_orig))])
# [3, 9, 0, 8, 1, 4, 7, 6, 2, 5]

我拥有完成我想要的所有必要的部分。但我不太确定如何完全放这个。任何建议将不胜感激。

修改

作为最终结果，我希望speed的原始索引按频率分组如下：

res = [[1, 4, 7], [3, 9], [0, 8], ...]
## corresponds to 3 600's, 2 300's, 2 400's, etc.
## for values of equal frequency, the secondary grouping is from min-to-max

这样，我可以按第n个最常值或频率本身选择值。

Answer 1

您可以通过以下方式获得所需的结果：

>>> idx = np.argsort(speed)
>>> res = sorted(np.split(idx, np.flatnonzero(np.diff(speed[idx])) + 1), key=len, reverse=True)
>>> res
[array([1, 4, 7]), array([3, 9]), array([0, 8]), array([6]), array([2]), array([5])]

如果所有索引都在子数组中，我如何使用字典将数组索引映射到相应的argsorted索引？

1 个答案: