在其他列定义的窗口内累计应用

时间:2018-11-12 15:25:20

标签: python pandas

我正在尝试将一个函数累计应用于“开始”和“完成”列定义的窗口内的值。因此,“开始”和“完成”定义了值处于“活动”状态的时间间隔;对于每一行,我想同时获取所有“活动”值的总和。

这是我追求的“ bruteforce”示例-是否有更优雅,更快或更有效的内存使用方式?

df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
    columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()

最初df是:

  start  finish  val
0   1      3     100
1   2      4     200
2   3      6     300
3   4      6     400
4   5      6     500

我追求的结果是:

1   100
2   300
3   500
4   700
5   1200

2 个答案:

答案 0 :(得分:7)

numba

from numba import njit

@njit
def pir_numba(S, F, V):
  mn = S.min()
  mx = F.max()
  out = np.zeros(mx)
  for s, f, v in zip(S, F, V):
    out[s:f] += v
  return out[mn:]

pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])

np.bincount

s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))

array([ 100.,  300.,  500.,  700., 1200.])

理解力

这取决于index是唯一的

pd.Series({
    (k, i): v
    for i, s, f, v in df.itertuples()
    for k in range(s, f)
}).sum(level=0)

1     100
2     300
3     500
4     700
5    1200
dtype: int64

不依赖index

pd.Series({
    (k, i): v
    for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
    for k in range(s, f)
}).sum(level=0)

答案 1 :(得分:6)

不幸的是,使用numpy董事会直播,它仍然是O(n * m)解决方案,但应该比groupby快。到目前为止,根据我的测试Pir 's solution的性能是最好的

s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100,  200,  300,  700, 1200], dtype=int64)

一些时间

#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop

def merged(df):
    df['dummy'] = 1
    df = df.merge(df, on=['dummy'], how='left')
    df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
    val = df.groupby('start_x')['val_y'].sum()
    return val

def npb(df):
    s1 = df['start'].values
    s2 = df['finish'].values
    return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)