我有一个(长)数组a
的一些不同的整数。我现在想创建一个字典,其中键是整数,值是索引数组,其中import numpy
a = numpy.array([1, 1, 5, 5, 1])
u = numpy.unique(a)
d = {val: numpy.where(a==val)[0] for val in u}
print(d)
出现相应的整数。此
{1: array([0, 1, 4]), 5: array([2, 3])}
unique
工作正常,但首先拨打where
,然后是几个from br.bdft
inner join time_periods on bdft.tran_dt between start_date and end_date
,似乎相当浪费。
np.digitize
似乎不太理想,因为你必须事先指定垃圾箱。
关于如何改进上述的任何想法?
答案 0 :(得分:5)
方法#1
基于排序的一种方法是 -
def group_into_dict(a):
# Get argsort indices
sidx = a.argsort()
# Use argsort indices to sort input array
sorted_a = a[sidx]
# Get indices that define the grouping boundaries based on identical elems
cut_idx = np.flatnonzero(np.r_[True,sorted_a[1:] != sorted_a[:-1],True])
# Form the final dict with slicing the argsort indices for values and
# the starts as the keys
return {sorted_a[i]:sidx[i:j] for i,j in zip(cut_idx[:-1], cut_idx[1:])}
示例运行 -
In [55]: a
Out[55]: array([1, 1, 5, 5, 1])
In [56]: group_into_dict(a)
Out[56]: {1: array([0, 1, 4]), 5: array([2, 3])}
使用1000000
元素和不同比例的唯一数字对数组进行计时,以便将建议的元素与原始元素进行比较 -
# 1/100 unique numbers
In [75]: a = np.random.randint(0,10000,(1000000))
In [76]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 6.62 s per loop
In [77]: %timeit group_into_dict(a)
10 loops, best of 3: 121 ms per loop
# 1/1000 unique numbers
In [78]: a = np.random.randint(0,1000,(1000000))
In [79]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
1 loop, best of 3: 720 ms per loop
In [80]: %timeit group_into_dict(a)
10 loops, best of 3: 92.1 ms per loop
# 1/10000 unique numbers
In [81]: a = np.random.randint(0,100,(1000000))
In [82]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 120 ms per loop
In [83]: %timeit group_into_dict(a)
10 loops, best of 3: 75 ms per loop
# 1/50000 unique numbers
In [84]: a = np.random.randint(0,20,(1000000))
In [85]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 60.8 ms per loop
In [86]: %timeit group_into_dict(a)
10 loops, best of 3: 60.3 ms per loop
所以,如果你只处理20
或更少的唯一数字,坚持原来的读取;否则基于排序似乎运作良好。
方法#2
Pandas
基于适用于极少数唯一数字的
In [142]: a
Out[142]: array([1, 1, 5, 5, 1])
In [143]: import pandas as pd
In [144]: {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
Out[144]: {1: array([0, 1, 4]), 5: array([2, 3])}
使用1000000
元素和20
个唯一元素的数组上的计时 -
In [146]: a = np.random.randint(0,20,(1000000))
In [147]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 35.6 ms per loop
# Original solution
In [148]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 58 ms per loop
以及更少的独特元素 -
In [149]: a = np.random.randint(0,10,(1000000))
In [150]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
10 loops, best of 3: 25.3 ms per loop
In [151]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 44.9 ms per loop
In [152]: a = np.random.randint(0,5,(1000000))
In [153]: %timeit {u:np.flatnonzero(a==u) for u in pd.Series(a).unique()}
100 loops, best of 3: 17.9 ms per loop
In [154]: %timeit {val: np.where(a==val)[0] for val in np.unique(a)}
10 loops, best of 3: 34.4 ms per loop
pandas
如何帮助减少元素数量?
对于基于approach #1
的排序,对于20
个唯一元素的情况,获取argsort索引是瓶颈 -
In [164]: a = np.random.randint(0,20,(1000000))
In [165]: %timeit a.argsort()
10 loops, best of 3: 51 ms per loop
现在,基于pandas
的函数为我们提供了唯一的元素,无论是负数还是任何东西,我们只是简单地与输入数组中的元素进行比较,而无需进行排序。让我们看看这方面的改进:
In [166]: %timeit pd.Series(a).unique()
100 loops, best of 3: 3.17 ms per loop
当然,它需要获得np.flatnonzero
个指数,这仍然会使它相对更有效率。
答案 1 :(得分:3)
使用ns,nd = number of samples, number of distincts
,您的解决方案O(ns*nd)
效率低下。
使用defaultdict的简单O(ns)
方法:
from collections import defaultdict
d=defaultdict(list)
for i,x in enumerate(a):d[x].append(i)
不幸的是因为python循环很慢,但速度比nd/ns>1%
慢。
如果O(ns)
(此处使用numba优化),则可以使用另一个nd/ns<<1
线性解决方案:
@numba.njit
def filldict_(a):
v=a.max()+1
cnts= np.zeros(v,np.int64)
for x in a:
cnts[x]+=1
g=cnts.max()
res=np.empty((v,g),np.int64)
cnts[:]=0
i=0
for x in a:
res[x,cnts[x]]=i
cnts[x]+=1
i+=1
return res,cnts,v
def filldict(a):
res,cnts,v=filldict_(a)
return {x:res[x,:cnts[x]] for x in range(v) if cnts[x]>0}
使用小键的随机数更快。跑:
In [51]: a=numpy.random.randint(0,100,1000000)
In [52]: %timeit d=group_into_dict(a) #Divakar
134 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [53]: %timeit c=filldict(a)
11.2 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
如果键是大整数,则可以添加查找表机制,几乎没有过载。
答案 2 :(得分:3)
如果它只是少数几个区别,那么使用argpartition而不是argsort可能是值得的。缺点是需要压缩步骤:
import numpy as np
def f_pp_1(a):
ws = np.empty(a.max()+1, int)
rng = np.arange(a.size)
ws[a] = rng
unq = rng[ws[a] == rng]
idx = np.argsort(a[unq])
unq = unq[idx]
ws[a[unq]] = np.arange(len(unq))
compressed = ws[a]
counts = np.cumsum(np.bincount(compressed))
return dict(zip(a[unq], np.split(np.argpartition(a, counts[:-1]), counts[:-1])))
如果uniques很小,我们可以保存sompresssion步骤:
def f_pp_2(a):
bc = np.bincount(a)
keys, = np.where(bc)
counts = np.cumsum(bc[keys])
return dict(zip(keys, np.split(np.argpartition(a, counts[:-1]), counts[:-1])))
data = np.random.randint(0, 10, (5,))[np.random.randint(0, 5, (10000000,))]
sol = f_pp_1(data)
for k, v in sol.items():
assert np.all(k == data[v])
如果我们可以避免where
,那么对于少数区别unique
具有竞争力:
def f_OP_plus(a):
ws = np.empty(a.max()+1, int)
rng = np.arange(a.size)
ws[a] = rng
unq = rng[ws[a] == rng]
idx = np.argsort(a[unq])
unq = unq[idx]
return {val: np.where(a==val)[0] for val in unq}
以下是使用与@Divakar(randint(0, nd, (ns,))
- nd,ns =区分数,样本数)相同的测试数组的时间(最佳3个,每个块10个):
nd, ns: 5, 1000000
OP 39.88609421 ms
OP_plus 13.04150990 ms
Divakar_1 44.14700069 ms
Divakar_2 21.64940450 ms
pp_1 33.15216140 ms
pp_2 22.43267260 ms
nd, ns: 10, 1000000
OP 52.33878891 ms
OP_plus 17.14743648 ms
Divakar_1 57.76002519 ms
Divakar_2 30.70066951 ms
pp_1 45.33982391 ms
pp_2 34.71166079 ms
nd, ns: 20, 1000000
OP 67.47841339 ms
OP_plus 26.41335099 ms
Divakar_1 71.37646740 ms
Divakar_2 43.09316459 ms
pp_1 57.16468811 ms
pp_2 45.55416510 ms
nd, ns: 50, 1000000
OP 98.91191521 ms
OP_plus 51.15756912 ms
Divakar_1 72.72288438 ms
Divakar_2 70.31920571 ms
pp_1 63.78925461 ms
pp_2 53.00321991 ms
nd, ns: 100, 1000000
OP 148.17743159 ms
OP_plus 92.62091429 ms
Divakar_1 85.02774101 ms
Divakar_2 116.78823209 ms
pp_1 77.01576019 ms
pp_2 66.70976470 ms
如果我们不将第一个nd
整数用于唯一身份,而是从0
和10000
之间随机抽取它们:
nd, ns: 5, 1000000
OP 40.11689581 ms
OP_plus 12.99256920 ms
Divakar_1 42.13181480 ms
Divakar_2 21.55767360 ms
pp_1 33.21835019 ms
pp_2 23.46851982 ms
nd, ns: 10, 1000000
OP 52.84317869 ms
OP_plus 17.96655210 ms
Divakar_1 57.74175161 ms
Divakar_2 32.31985010 ms
pp_1 44.79893579 ms
pp_2 33.42640731 ms
nd, ns: 20, 1000000
OP 66.46886449 ms
OP_plus 25.78120639 ms
Divakar_1 66.58960858 ms
Divakar_2 42.47685110 ms
pp_1 53.67698781 ms
pp_2 44.53037870 ms
nd, ns: 50, 1000000
OP 98.95576960 ms
OP_plus 50.79147881 ms
Divakar_1 72.44545210 ms
Divakar_2 70.91441818 ms
pp_1 64.19071071 ms
pp_2 53.36350428 ms
nd, ns: 100, 1000000
OP 145.62422500 ms
OP_plus 90.82918381 ms
Divakar_1 76.92769479 ms
Divakar_2 115.24481240 ms
pp_1 70.85122908 ms
pp_2 58.85340699 ms
答案 3 :(得分:0)
groupby
及其indices
功能df = pd.DataFrame(a)
d = df.groupby(0).indices
a = np.random.randint(0,10000,(1000000))
%%timeit
df = pd.DataFrame(a)
d = df.groupby(0).indices
42.6 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
a = np.random.randint(0,100,(1000000))
%%timeit
df = pd.DataFrame(a)
d = df.groupby(0).indices
22.3 ms ± 5.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
groupby
(如果您已经知道密钥或可以使用其他方法快速获取密钥)a = np.random.randint(0,100,(1000000))
%%timeit
df = pd.DataFrame(a)
d = df.groupby(0)
206 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
groupby
本身非常快,但它不会给你钥匙。如果您已经知道了密钥,那么您可以将groupby对象作为字典。用法:
d.get_group(key).index # index part is what you need!
缺点: d.get_group(key)
本身将耗费非常重要的时间。
%timeit d.get_group(10).index
496 µs ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
所以这取决于你的应用程序以及你是否知道决定是否采用这种方法的关键。
如果您的所有值均为正值,您可以使用np.nonzero(np.bincount(a))[0]
以合理的速度获取密钥。 (对于a = np.random.randint(0,1000,(1000000)),1.57 ms±78.2μs)