将整数数组转换为索引字典

时间:2017-03-11 20:50:52

标签: python arrays numpy

我有一个(大)整数数组,如

materials = [0, 0, 47, 0, 2, 2, 47]  # ...

只有很少的唯一条目,我想把它转换成索引词典,即

d = {
    0: [0, 1, 3],
    2: [4, 5],
    47: [2, 6],
    }

最有效的方法是什么? (NumPy欢迎。)

8 个答案:

答案 0 :(得分:4)

使用enumerate()dict.setdefault()函数的替代解决方案:

materials = [0, 0, 47, 0, 2, 2, 47]
d = {}
for k,m in enumerate(materials):
    d.setdefault(m, []).append(k)

print(d)

输出:

{0: [0, 1, 3], 2: [4, 5], 47: [2, 6]}

答案 1 :(得分:3)

不需要numpy,这些是标准的python结构,dict理解能很好地解决你的问题:

materials = [0, 0, 47, 0, 2, 2, 47]

d = {v : [i for i,x in enumerate(materials) if x==v] for v in set(materials)}

print(d)

结果:

{0: [0, 1, 3], 2: [4, 5], 47: [2, 6]}

[i for i,x in enumerate(materials) if x==v]查找列表中元素的所有索引(index只找到第一个索引)

在我的回答的第一个版本中,我在列表本身上进行迭代,但这有点浪费,因为当有很多事件发生时,它会多次覆盖密钥,内部列表理解有{{ 1}}复杂性因此总体复杂性不太好。

当我写这篇最终评论时,有人建议迭代独特的元素,这很好,所以将输入列表转换为n

答案 2 :(得分:3)

您可能会在这里找到collections.defaultdict,当第一次找到某个元素时,它会为您创建一个新列表。

from collections import defaultdict

indices = defaultdict(list)

for i, elem in enumerate(materials):
    indices[elem].append(i)

答案 3 :(得分:3)

这是一个笨拙的解决方案:

import numpy as np

a = np.random.randint(0, 1000, 1000000)
index = np.argsort(a, kind='mergesort')
as_  = a[index]
jumps = np.r_[0, 1 + np.where(np.diff(as_) != 0)[0]]
result = {k: v for k, v in zip(as_[jumps], np.split(index, jumps[1:]))}

基准

numpy因不太大n而获胜;因为它使用了O(n log n)排序算法,所以边距很小(pp2是一种​​变体,它以快速排序取代缓慢但稳定的mergesort,代价是之后必须对各个索引列表进行排序,pp3取代了完整排序argpartition如果独特元素的数量与元素数量相比较小,则会获得一定的速度。):

原始数组中的10个不同的整数值: enter image description here

原始数组中的100个不同的整数值: enter image description here

基准代码供参考:

import numpy as np
from collections import defaultdict
import perfplot


def pp(a):
    index = np.argsort(a, kind='mergesort')
    as_ = a[index]
    jumps = np.r_[0, 1 + np.where(np.diff(as_) != 0)[0]]
    pp_out = {k: v for k, v in zip(as_[jumps], np.split(index, jumps[1:]))}
    return pp_out


def pp2(a):
    index = np.argsort(a)
    as_ = a[index]
    jumps = np.r_[0, 1 + np.where(np.diff(as_) != 0)[0]]
    pp_out = {k: np.sort(v)
              for k, v in zip(as_[jumps], np.split(index, jumps[1:]))}
    return pp_out


def Denziloe_JFFabre(a):
    df_out = {v: [i for i, x in enumerate(a) if x == v] for v in set(a)}
    return df_out


def FCouzo(a):
    fc_out = defaultdict(list)
    for i, elem in enumerate(a):
        fc_out[elem].append(i)
    return fc_out


def KKSingh(a):
    kks_out = defaultdict(list)
    list(map(lambda x: kks_out[x[0]].append(x[1]), zip(a, range(len(a)))))
    return kks_out


def TMcDonaldJensen(a):
    mdj_out = defaultdict(list)
    for i, elem in enumerate(a):
        mdj_out[elem].append(i)
    return mdj_out


def RomanPerekhrest(a):
    rp_out = {}
    for k, m in enumerate(a):
        rp_out.setdefault(m, []).append(k)
    return rp_out


def SchloemerHist(a):
    np.histogram(a, bins=np.arange(min(a), max(a)+2))
    return


def SchloemerWhere(a):
    out = {v: np.where(v == a)[0] for v in set(a)}
    return out


perfplot.show(
        setup=lambda n: np.random.randint(0, 10, n),
        kernels=[
            pp, pp2, Denziloe_JFFabre, FCouzo, KKSingh,
            TMcDonaldJensen, RomanPerekhrest, SchloemerHist, SchloemerWhere
            ],
        n_range=[2**k for k in range(19)],
        xlabel='len(a)',
        logx=True,
        logy=True,
        )

答案 4 :(得分:1)

理解能很好地做到这一点:

d = {key:[i for i, v in enumerate(materials) if v == key] for key in set(materials)}

答案 5 :(得分:1)

我使用defaultdict,与Jean的回答(O(n))相比,效率更高(O(n^2)时间:

from collections import defaultdict
materials = [0, 0, 47, 0, 2, 2, 47]
d = defaultdict(list)
for i, elem in enumerate(materials):
    d[elem].append(i)

d现在等于:

defaultdict(<type 'list'>, {0: [0, 1, 3], 2: [4, 5], 47: [2, 6]})

答案 6 :(得分:1)

另一个单行,这次是numpy.where

out = {v: np.where(v == a)[0] for v in numpy.unique(a)}

(对于某些应用程序,布尔数组可能就足够了:

out = {v: v == a for v in numpy.unique(a)}

请注意,对于大型数组,numpy.uniqueset()快,如果只有少数唯一条目,则import numpy as np from collections import defaultdict import perfplot def pp(a): index = np.argsort(a, kind='mergesort') as_ = a[index] jumps = np.r_[0, 1 + np.where(np.diff(as_) != 0)[0]] pp_out = {k: v for k, v in zip(as_[jumps], np.split(index, jumps[1:]))} return pp_out def pp2(a): index = np.argsort(a) as_ = a[index] jumps = np.r_[0, 1 + np.where(np.diff(as_) != 0)[0]] pp_out = {k: np.sort(v) for k, v in zip(as_[jumps], np.split(index, jumps[1:]))} return pp_out def Denziloe_JFFabre(a): df_out = {v: [i for i, x in enumerate(a) if x == v] for v in np.unique(a)} return df_out def FCouzo(a): fc_out = defaultdict(list) for i, elem in enumerate(a): fc_out[elem].append(i) return fc_out def KKSingh(a): kks_out = defaultdict(list) list(map(lambda x: kks_out[x[0]].append(x[1]), zip(a, range(len(a))))) return kks_out def TMcDonaldJensen(a): mdj_out = defaultdict(list) for i, elem in enumerate(a): mdj_out[elem].append(i) return mdj_out def RomanPerekhrest(a): rp_out = {} for k, m in enumerate(a): rp_out.setdefault(m, []).append(k) return rp_out def SchloemerHist(a): np.histogram(a, bins=np.arange(min(a), max(a)+2)) return def SchloemerWhere(a): out = {v: np.where(v == a)[0] for v in np.unique(a)} return out def SchloemerBooleanOnly(a): out = {v: v == a for v in np.unique(a)} return out perfplot.show( setup=lambda n: np.random.randint(0, 100, n), kernels=[ pp, pp2, Denziloe_JFFabre, FCouzo, KKSingh, TMcDonaldJensen, RomanPerekhrest, SchloemerHist, SchloemerWhere, SchloemerBooleanOnly ], n_range=[2**k for k in range(17)], xlabel='len(a)', logx=True, logy=True, ) 会大于{{1}}。

无论如何,对于大多数数组大小,上面是最快的方法:

10个不同的整数: enter image description here

100个不同的整数: enter image description here

代码:

{{1}}

答案 7 :(得分:0)

为了好玩,这里有一个使用numpy.histogram的解决方案:

np.histogram(a, bins=np.arange(min(a), max(a)+2))

我认为它可能表现不错,但保罗的解决方案仍然更好:

enter image description here