所有
我有一个已排序的时间序列my_ts
,我需要找到所有之间的成对差异(低于某个阈值horizon
)系列(而不仅仅是连续元素之间)。
我编写了以下代码来执行此操作,但正如您所看到的,它应用 itertools ,这在pandas环境中感觉不必要。
from itertools import combinations
my_ts = pd.Series(pd.date_range('1/1/2018', periods=6, freq='d'))
def count_gaps(ts, horizon):
# returns counts of all gaps shorter than horizon
diffs = ((t2-t1) for (t1, t2) in combinations(ts, 2) if t2-t1<=horizon)
return pd.Series(diffs).value_counts()
count_gaps(my_ts, horizon=pd.to_timedelta(3, unit='d'))
有关更多Pandaistic(并希望更快)解决方案的任何建议吗?
答案 0 :(得分:1)
我认为你可以
s=pd.DataFrame(columns=my_ts,index=my_ts).apply(lambda x : x.name-x.index)
s.mask((s<pd.Timedelta('1 days'))|(s>pd.Timedelta('3 days'))).stack().value_counts()
Out[528]:
1 days 5
2 days 4
3 days 3
dtype: int64
答案 1 :(得分:0)
这是一个明显更快的(在下面的快速基准测试中约为x100)NumPy解决方案:
import numpy as np
import pandas as pd
def count_gaps1(ts, horizon):
diffs = ts.values[:, np.newaxis] - ts.values[np.newaxis, :]
idx, c = np.unique(diffs[(diffs > np.timedelta64(0)) & (diffs <= horizon.to_timedelta64())], return_counts=True)
return pd.Series(c, idx)
my_ts = pd.Series(pd.date_range('1/1/2018', periods=6, freq='d'))
print(count_gaps(my_ts, horizon=pd.to_timedelta(3, unit='d')))
输出:
1 days 5
2 days 4
3 days 3
dtype: int64
IPython %timeit
基准:
import itertools
import numpy as np
import pandas as pd
def count_gaps_original(ts, horizon):
diffs = ((t2-t1) for (t1, t2) in itertools.combinations(ts, 2) if t2-t1<=horizon)
return pd.Series(diffs).value_counts()
def count_gaps_Wen(ts, horizon):
s = pd.DataFrame(columns=my_ts,index=my_ts).apply(lambda x : x.name-x.index)
return s.mask((s < pd.Timedelta('1 days')) | (s > pd.Timedelta('3 days'))).stack().value_counts()
def count_gaps_jdehesa(ts, horizon):
diffs = ts.values[:, np.newaxis] - ts.values[np.newaxis, :]
idx, c = np.unique(diffs[(diffs > np.timedelta64(0)) & (diffs <= horizon.to_timedelta64())], return_counts=True)
return pd.Series(c, idx)
my_ts = pd.Series(pd.date_range('1/1/2018', periods=100, freq='d'))
%timeit count_gaps_original(my_ts, horizon=pd.to_timedelta(3, unit='d'))
>>> 145 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit count_gaps_Wen(my_ts, horizon=pd.to_timedelta(3, unit='d'))
>>> 44.1 ms ± 942 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit count_gaps_jdehesa(my_ts, horizon=pd.to_timedelta(3, unit='d'))
>>> 409 µs ± 9.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)