Question

我有一个带有float64索引的大型pandas系列。

e.g。

s = pandas.Series([1,2,3,4,5], index=[1.0,2.0,3.0,4.0,5.0])

但有100,000行。

我想将多个切片拉回到单个子集系列中。目前我正在通过构建切片列表然后连接它们来实现这一目标

e.g。

intervals = [(1,2), (4,8)]
s2 = pandas.concat([s.ix[start:end] for start, end in intervals])

其中，间隔将是一个通常约为10-20个条目的列表。但是，这是慢。事实上，这一行占我程序整个执行时间的62％，这对我的一小部分数据大约需要30秒（约占整个数据集的1/2）。

有谁知道更好的方法吗？

Answer 1

如果值numpy列表中的每个区间之间的值，则需要一些聪明的array index广播来检查interval中的每个值（在两端打开，使得＆gt; = low_end和＆lt; = high_end）：

In [158]:
import numpy as np
def f(a1, a2):
    return (((a1 - a2[:,:,np.newaxis])).prod(1)<=0).any(0)
In [159]:

f(s.index.values, np.array(intervals))
Out[159]:
array([ True,  True, False,  True,  True], dtype=bool)
In [160]:

%timeit s.ix[f(s.index.values, np.array(intervals))]
1000 loops, best of 3: 212 µs per loop
In [161]:

%timeit s[f(s.index.values, np.array(intervals))]
10000 loops, best of 3: 177 µs per loop
In [162]:

%timeit pd.concat([s.ix[start: end] for start, end in intervals])
1000 loops, best of 3: 1.64 ms per loop

结果：

1    1
2    2
4    4
5    5
dtype: int64

使用浮动切片列表切片大熊猫系列

1 个答案: