在0.20中有一个名为Intervalindex
new的有趣API,它允许您创建间隔索引。
给出一些样本数据:
data = [(893.1516130000001, 903.9187099999999),
(882.384516, 893.1516130000001),
(817.781935, 828.549032)]
您可以像这样创建索引:
idx = pd.IntervalIndex.from_tuples(data)
print(idx)
IntervalIndex([(893.151613, 903.91871], (882.384516, 893.151613], (817.781935, 828.549032]]
closed='right',
dtype='interval[float64]')
Interval
的有趣属性是您可以使用in
执行间隔检查:
print(y[-1])
Interval(817.78193499999998, 828.54903200000001, closed='right')
print(820 in y[-1])
True
print(1000 in y[-1])
False
我想知道如何将此操作应用于整个索引。例如,给定一些数字900
,我如何检索此数字适合的区间的布尔掩码?
我能想到:
m = [900 in y for y in idx]
print(m)
[True, False, False]
有更好的方法吗?
答案 0 :(得分:15)
如果您对性能感兴趣,IntervalIndex会针对搜索进行优化。使用.get_loc
或.get_indexer
使用内部构建的IntervalTree(如二叉树),它是在首次使用时构建的。
In [29]: idx = pd.IntervalIndex.from_tuples(data*10000)
In [30]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
92.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [40]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
42.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# construct tree and search
In [31]: %timeit -n 1 -r 1 idx.get_loc(900)
4.55 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# subsequently
In [32]: %timeit -n 1 -r 1 idx.get_loc(900)
137 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# for a single indexer you can do even better (note that this is
# dipping into the impl a bit
In [27]: %timeit np.arange(len(idx))[(900 > idx.left) & (900 <= idx.right)]
203 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
请注意.get_loc()返回一个索引器(实际上它比布尔数组更有用,但它们可以相互转换)。
In [38]: idx.map(lambda x: 900 in x)
...:
Out[38]:
Index([ True, False, False, True, False, False, True, False, False, True,
...
False, True, False, False, True, False, False, True, False, False], dtype='object', length=30000)
In [39]: idx.get_loc(900)
...:
Out[39]: array([29997, 9987, 10008, ..., 19992, 19989, 0])
返回布尔数组将转换为索引器数组
In [5]: np.arange(len(idx))[idx.map(lambda x: 900 in x).values.astype(bool)]
Out[5]: array([ 0, 3, 6, ..., 29991, 29994, 29997])
这就是.get_loc()和.get_indexer()返回:
In [6]: np.sort(idx.get_loc(900))
Out[6]: array([ 0, 3, 6, ..., 29991, 29994, 29997])
答案 1 :(得分:3)
您可以使用map
:
idx.map(lambda x: 900 in x)
#Index([True, False, False], dtype='object')
时序:
%timeit [900 in y for y in idx]
#100000 loops, best of 3: 3.76 µs per loop
%timeit idx.map(lambda x: 900 in x)
#10000 loops, best of 3: 48.7 µs per loop
%timeit map(lambda x: 900 in x, idx)
#100000 loops, best of 3: 4.95 µs per loop
显然,理解是最快的,但内置map
不会落后太多。
当我们引入更多数据时,结果甚至会出来,确切地说是数据的10K倍:
%timeit [900 in y for y in idx]
#10 loops, best of 3: 26.8 ms per loop
%timeit idx.map(lambda x: 900 in x)
#10 loops, best of 3: 30 ms per loop
%timeit map(lambda x: 900 in x, idx)
#10 loops, best of 3: 29.5 ms per loop
正如我们所见,内置map
非常接近.map()
所以 - 让我们看看10倍甚至更多数据会发生什么:
%timeit [900 in y for y in idx]
#1 loop, best of 3: 270 ms per loop
%timeit idx.map(lambda x: 900 in x)
#1 loop, best of 3: 299 ms per loop
%timeit map(lambda x: 900 in x, idx)
#1 loop, best of 3: 291 ms per loop
结论:
理解是胜利者,但对于大量数据并不那么明显。
答案 2 :(得分:3)
如果您正在寻找速度,您可以使用idx的左右,即从范围获得下限和上限,然后检查数字是否在界限之间,即
list(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))
或者
[(900 > idx.left) & (900 <= idx.right)]
[True, False, False]
对于小数据
%%timeit
list(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))
100000 loops, best of 3: 11.26 µs per loop
%%timeit
[900 in y for y in idx]
100000 loops, best of 3: 9.26 µs per loop
对于大数据
idx = pd.IntervalIndex.from_tuples(data*10000)
%%timeit
list(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))
10 loops, best of 3: 29.2 ms per loop
%%timeit
[900 in y for y in idx]
10 loops, best of 3: 64.6 ms per loop
此方法优于您的大数据解决方案。
答案 3 :(得分:0)
使用NumPy
import numpy as np
data = [(893.1516130000001, 903.9187099999999),
(882.384516, 893.1516130000001),
(817.781935, 828.549032)]
q = 900
# The next line broadcast q and tell if q is within the intervals/ranges defined in data (using numpy)
np.logical_xor(*(np.array(data) - q > 0).transpose())