从整数列表创建索引字典

时间:2018-01-30 18:41:33

标签: python numpy

我有一个(长)数组a的一些不同的整数。我现在想创建一个字典,其中键是整数,值是索引数组,其中import numpy a = numpy.array([1, 1, 5, 5, 1]) u = numpy.unique(a) d = {val: numpy.where(a==val)[0] for val in u} print(d) 出现相应的整数。此

{1: array([0, 1, 4]), 5: array([2, 3])}
unique

工作正常,但首先拨打where,然后是几个from br.bdft inner join time_periods on bdft.tran_dt between start_date and end_date ,似乎相当浪费。

np.digitize似乎不太理想,因为你必须事先指定垃圾箱。

关于如何改进上述的任何想法?

4 个答案:

答案 0 :(得分:5)

方法#1

基于排序的一种方法是 -

def group_into_dict(a): 
    # Get argsort indices
    sidx = a.argsort()

    # Use argsort indices to sort input array
    sorted_a = a[sidx]

    # Get indices that define the grouping boundaries based on identical elems
    cut_idx = np.flatnonzero(np.r_[True,sorted_a[1:] != sorted_a[:-1],True])

    # Form the final dict with slicing the argsort indices for values and
    # the starts as the keys
    return {sorted_a[i]:sidx[i:j] for i,j in zip(cut_idx[:-1], cut_idx[1:])}

示例运行 -

In [55]: a
Out[55]: array([1, 1, 5, 5, 1])

In [56]: group_into_dict(a)
Out[56]: {1: array([0, 1, 4]), 5: array([2, 3])}

使用1000000元素和不同比例的唯一数字对数组进行计时,以便将建议的元素与原始元素进行比较 -

# 1/100 unique numbers
In [75]: a = np.random.randint(0,10000,(1000000))

In [76]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 6.62 s per loop

In [77]: %timeit group_into_dict(a)
10 loops, best of 3: 121 ms per loop

# 1/1000 unique numbers
In [78]: a = np.random.randint(0,1000,(1000000))

In [79]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 720 ms per loop

In [80]: %timeit group_into_dict(a)
10 loops, best of 3: 92.1 ms per loop

# 1/10000 unique numbers
In [81]: a = np.random.randint(0,100,(1000000))

In [82]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 120 ms per loop

In [83]: %timeit group_into_dict(a)
10 loops, best of 3: 75 ms per loop

# 1/50000 unique numbers
In [84]: a = np.random.randint(0,20,(1000000))

In [85]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 60.8 ms per loop

In [86]: %timeit group_into_dict(a)
10 loops, best of 3: 60.3 ms per loop

所以,如果你只处理20或更少的唯一数字,坚持原来的读取;否则基于排序似乎运作良好。

方法#2

Pandas基于适用于极少数唯一数字的

In [142]: a
Out[142]: array([1, 1, 5, 5, 1])

In [143]: import pandas as pd

In [144]: {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
Out[144]: {1: array([0, 1, 4]), 5: array([2, 3])}

使用1000000元素和20个唯一元素的数组上的计时 -

In [146]: a = np.random.randint(0,20,(1000000))

In [147]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 35.6 ms per loop

# Original solution
In [148]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 58 ms per loop

以及更少的独特元素 -

In [149]: a = np.random.randint(0,10,(1000000))

In [150]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 25.3 ms per loop

In [151]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 44.9 ms per loop

In [152]: a = np.random.randint(0,5,(1000000))

In [153]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
100 loops, best of 3: 17.9 ms per loop

In [154]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 34.4 ms per loop

pandas如何帮助减少元素数量?

对于基于approach #1的排序,对于20个唯一元素的情况,获取argsort索引是瓶颈 -

In [164]: a = np.random.randint(0,20,(1000000))

In [165]: %timeit a.argsort()
10 loops, best of 3: 51 ms per loop

现在,基于pandas的函数为我们提供了唯一的元素,无论是负数还是任何东西,我们只是简单地与输入数组中的元素进行比较,而无需进行排序。让我们看看这方面的改进:

In [166]: %timeit pd.Series(a).unique()
100 loops, best of 3: 3.17 ms per loop

当然,它需要获得np.flatnonzero个指数,这仍然会使它相对更有效率。

答案 1 :(得分:3)

使用ns,nd = number of samples, number of distincts,您的解决方案O(ns*nd)效率低下。

使用defaultdict的简单O(ns)方法:

from collections import defaultdict
d=defaultdict(list)
for i,x in enumerate(a):d[x].append(i)

不幸的是因为python循环很慢,但速度比nd/ns>1%慢。

如果O(ns)(此处使用numba优化),则可以使用另一个nd/ns<<1线性解决方案:

@numba.njit
def filldict_(a):
    v=a.max()+1
    cnts= np.zeros(v,np.int64)
    for x in a:
        cnts[x]+=1
    g=cnts.max()
    res=np.empty((v,g),np.int64)
    cnts[:]=0
    i=0
    for x in a:
        res[x,cnts[x]]=i
        cnts[x]+=1
        i+=1
    return res,cnts,v

def filldict(a):
    res,cnts,v=filldict_(a)
    return {x:res[x,:cnts[x]] for x in range(v) if cnts[x]>0}    

使用小键的随机数更快。跑:

In [51]: a=numpy.random.randint(0,100,1000000)

In [52]: %timeit d=group_into_dict(a) #Divakar
134 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [53]: %timeit c=filldict(a)
11.2 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

如果键是大整数,则可以添加查找表机制,几乎没有过载。

答案 2 :(得分:3)

如果它只是少数几个区别,那么使用argpartition而不是argsort可能是值得的。缺点是需要压缩步骤:

import numpy as np

def f_pp_1(a):
    ws = np.empty(a.max()+1, int)
    rng = np.arange(a.size)
    ws[a] = rng
    unq = rng[ws[a] == rng]
    idx = np.argsort(a[unq])
    unq = unq[idx]
    ws[a[unq]] = np.arange(len(unq))
    compressed = ws[a]
    counts = np.cumsum(np.bincount(compressed))
    return dict(zip(a[unq], np.split(np.argpartition(a, counts[:-1]), counts[:-1])))

如果uniques很小,我们可以保存sompresssion步骤:

def f_pp_2(a):
    bc = np.bincount(a)
    keys, = np.where(bc)
    counts = np.cumsum(bc[keys])
    return dict(zip(keys, np.split(np.argpartition(a, counts[:-1]), counts[:-1])))

data = np.random.randint(0, 10, (5,))[np.random.randint(0, 5, (10000000,))]


sol = f_pp_1(data)
for k, v in sol.items():
    assert np.all(k == data[v])

如果我们可以避免where,那么对于少数区别unique具有竞争力:

def f_OP_plus(a):
    ws = np.empty(a.max()+1, int)
    rng = np.arange(a.size)
    ws[a] = rng
    unq = rng[ws[a] == rng]
    idx = np.argsort(a[unq])
    unq = unq[idx]
    return {val: np.where(a==val)[0] for val in unq}

以下是使用与@Divakar(randint(0, nd, (ns,)) - nd,ns =区分数,样本数)相同的测试数组的时间(最佳3个,每个块10个):

nd, ns: 5, 1000000
OP                   39.88609421 ms
OP_plus              13.04150990 ms
Divakar_1            44.14700069 ms
Divakar_2            21.64940450 ms
pp_1                 33.15216140 ms
pp_2                 22.43267260 ms
nd, ns: 10, 1000000
OP                   52.33878891 ms
OP_plus              17.14743648 ms
Divakar_1            57.76002519 ms
Divakar_2            30.70066951 ms
pp_1                 45.33982391 ms
pp_2                 34.71166079 ms
nd, ns: 20, 1000000
OP                   67.47841339 ms
OP_plus              26.41335099 ms
Divakar_1            71.37646740 ms
Divakar_2            43.09316459 ms
pp_1                 57.16468811 ms
pp_2                 45.55416510 ms
nd, ns: 50, 1000000
OP                   98.91191521 ms
OP_plus              51.15756912 ms
Divakar_1            72.72288438 ms
Divakar_2            70.31920571 ms
pp_1                 63.78925461 ms
pp_2                 53.00321991 ms
nd, ns: 100, 1000000
OP                  148.17743159 ms
OP_plus              92.62091429 ms
Divakar_1            85.02774101 ms
Divakar_2           116.78823209 ms
pp_1                 77.01576019 ms
pp_2                 66.70976470 ms

如果我们不将第一个nd整数用于唯一身份,而是从010000之间随机抽取它们:

nd, ns: 5, 1000000
OP                   40.11689581 ms
OP_plus              12.99256920 ms
Divakar_1            42.13181480 ms
Divakar_2            21.55767360 ms
pp_1                 33.21835019 ms
pp_2                 23.46851982 ms
nd, ns: 10, 1000000
OP                   52.84317869 ms
OP_plus              17.96655210 ms
Divakar_1            57.74175161 ms
Divakar_2            32.31985010 ms
pp_1                 44.79893579 ms
pp_2                 33.42640731 ms
nd, ns: 20, 1000000
OP                   66.46886449 ms
OP_plus              25.78120639 ms
Divakar_1            66.58960858 ms
Divakar_2            42.47685110 ms
pp_1                 53.67698781 ms
pp_2                 44.53037870 ms
nd, ns: 50, 1000000
OP                   98.95576960 ms
OP_plus              50.79147881 ms
Divakar_1            72.44545210 ms
Divakar_2            70.91441818 ms
pp_1                 64.19071071 ms
pp_2                 53.36350428 ms
nd, ns: 100, 1000000
OP                  145.62422500 ms
OP_plus              90.82918381 ms
Divakar_1            76.92769479 ms
Divakar_2           115.24481240 ms
pp_1                 70.85122908 ms
pp_2                 58.85340699 ms

答案 3 :(得分:0)

pandas 解决方案1:使用groupby及其indices功能

df = pd.DataFrame(a)
d = df.groupby(0).indices

a = np.random.randint(0,10000,(1000000))
%%timeit
df = pd.DataFrame(a)
d = df.groupby(0).indices
42.6 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
a = np.random.randint(0,100,(1000000))
%%timeit
df = pd.DataFrame(a)
d = df.groupby(0).indices
22.3 ms ± 5.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

pandas 解决方案2:仅使用groupby(如果您已经知道密钥或可以使用其他方法快速获取密钥)

a = np.random.randint(0,100,(1000000))
%%timeit 
df = pd.DataFrame(a)
d = df.groupby(0)
206 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

groupby本身非常快,但它不会给你钥匙。如果您已经知道了密钥,那么您可以将groupby对象作为字典。用法:

d.get_group(key).index  # index part is what you need!

缺点: d.get_group(key)本身将耗费非常重要的时间。

%timeit d.get_group(10).index 
496 µs ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

所以这取决于你的应用程序以及你是否知道决定是否采用这种方法的关键。

如果您的所有值均为正值,您可以使用np.nonzero(np.bincount(a))[0]以合理的速度获取密钥。 (对于a = np.random.randint(0,1000,(1000000)),1.57 ms±78.2μs)