找到pandas系列中元素之间的所有成对差异

时间:2018-03-14 14:52:18

标签: python pandas time-series

所有

我有一个已排序的时间序列my_ts,我需要找到所有之间的成对差异(低于某个阈值horizon)系列(而不仅仅是连续元素之间)。

我编写了以下代码来执行此操作,但正如您所看到的,它应用 itertools ,这在pandas环境中感觉不必要。

from itertools import combinations
my_ts = pd.Series(pd.date_range('1/1/2018', periods=6, freq='d'))

def count_gaps(ts, horizon):
    # returns counts of all gaps shorter than horizon
    diffs = ((t2-t1) for (t1, t2) in combinations(ts, 2) if t2-t1<=horizon)
    return pd.Series(diffs).value_counts()

count_gaps(my_ts, horizon=pd.to_timedelta(3, unit='d'))

有关更多Pandaistic(并希望更快)解决方案的任何建议吗?

2 个答案:

答案 0 :(得分:1)

我认为你可以

s=pd.DataFrame(columns=my_ts,index=my_ts).apply(lambda x : x.name-x.index)

s.mask((s<pd.Timedelta('1 days'))|(s>pd.Timedelta('3 days'))).stack().value_counts()
Out[528]: 
1 days    5
2 days    4
3 days    3
dtype: int64

答案 1 :(得分:0)

这是一个明显更快的(在下面的快速基准测试中约为x100)NumPy解决方案:

import numpy as np
import pandas as pd

def count_gaps1(ts, horizon):
    diffs = ts.values[:, np.newaxis] - ts.values[np.newaxis, :]
    idx, c = np.unique(diffs[(diffs > np.timedelta64(0)) & (diffs <= horizon.to_timedelta64())], return_counts=True)
    return pd.Series(c, idx)

my_ts = pd.Series(pd.date_range('1/1/2018', periods=6, freq='d'))
print(count_gaps(my_ts, horizon=pd.to_timedelta(3, unit='d')))

输出:

1 days    5
2 days    4
3 days    3
dtype: int64

IPython %timeit基准:

import itertools
import numpy as np
import pandas as pd

def count_gaps_original(ts, horizon):
    diffs = ((t2-t1) for (t1, t2) in itertools.combinations(ts, 2) if t2-t1<=horizon)
    return pd.Series(diffs).value_counts()

def count_gaps_Wen(ts, horizon):
    s = pd.DataFrame(columns=my_ts,index=my_ts).apply(lambda x : x.name-x.index)
    return s.mask((s < pd.Timedelta('1 days')) | (s > pd.Timedelta('3 days'))).stack().value_counts()

def count_gaps_jdehesa(ts, horizon):
    diffs = ts.values[:, np.newaxis] - ts.values[np.newaxis, :]
    idx, c = np.unique(diffs[(diffs > np.timedelta64(0)) & (diffs <= horizon.to_timedelta64())], return_counts=True)
    return pd.Series(c, idx)

my_ts = pd.Series(pd.date_range('1/1/2018', periods=100, freq='d'))
%timeit count_gaps_original(my_ts, horizon=pd.to_timedelta(3, unit='d'))
>>> 145 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit count_gaps_Wen(my_ts, horizon=pd.to_timedelta(3, unit='d'))
>>> 44.1 ms ± 942 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit count_gaps_jdehesa(my_ts, horizon=pd.to_timedelta(3, unit='d'))
>>> 409 µs ± 9.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)