使用另一列的偏移值比较Pandas数据框列中的值

时间:2018-06-21 14:02:32

标签: python performance pandas dataframe

我的数据框为:

python stack.py

>2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609]
0

Time InvInstance 5 5 8 4 9 3 19 2 20 1 3 3 8 2 13 1 变量已排序,Time变量表示到InvInstance块末尾的行数。我想创建另一列,显示Time列中是否满足交叉条件。我可以用这样的for循环来做到这一点:

Time

所需的输出是:

import pandas as pd
import numpy as np

df = pd.read_csv("test.csv")

df["10mMark"] = 0
for i in range(1,len(df)):
    r = int(df.InvInstance.iloc[i])
    rprev = int(df.InvInstance.iloc[i-1])
    m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
    mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
    df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)

更具体些;在“时间”列中有2个排序的时间块,并且逐行前进,我们可以通过InvInstance的值知道到每个块末尾的距离(以行为单位)。问题是行与块末尾之间的时间差是否小于10分钟,并且在上一行中是否大于10。是否可以在没有Time InvInstance 10mMark 5 5 0 8 4 0 9 3 0 19 2 1 20 1 0 3 3 0 8 2 1 13 1 0 等循环的情况下执行此操作,从而使其运行得更快?

2 个答案:

答案 0 :(得分:4)

我看不到/不知道如何使用内部矢量化的Pandas / Numpy方法通过非标量/矢量步骤来移动Series / Array,但是我们可以在此处使用Numba

from numba import jit

@jit
def dyn_shift(s, step):
    assert len(s) == len(step), "[s] and [step] should have the same length"
    assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
    assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
    N = len(s)
    res = np.empty(N, dtype=s.dtype)
    for i in range(N):
        res[i] = s[i+step[i]-1]
    return res

mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)

结果:

In [6]: df
Out[6]:
   Time  InvInstance  10mMark
0     5            5        0
1     8            4        0
2     9            3        0
3    19            2        1
4    20            1        0
5     3            3        0
6     8            2        1
7    13            1        0

为8.000行DF计时:

In [13]: df = pd.concat([df] * 10**3, ignore_index=True)

In [14]: df.shape
Out[14]: (8000, 3)

In [15]: %%timeit
    ...: df["10mMark"] = 0
    ...: for i in range(1,len(df)):
    ...:     r = int(df.InvInstance.iloc[i])
    ...:     rprev = int(df.InvInstance.iloc[i-1])
    ...:     m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
    ...:     mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
    ...:     df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
    ...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %%timeit
    ...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
    ...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
    ...: df['10mMark'] = np.where(mask1 & mask2,1,0)
    ...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

提速因子:

In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0

答案 1 :(得分:3)

实际上,您的m是一行时间与'block'末尾时间之间的时间差,而mprev是同一件事,但与前一个时间相同行(因此实际上是m的移位)。我的想法是通过首先标识每个块,然后在块上使用merge时用last时间groupby来创建一个包含块末尾时间的列。然后计算创建列“ m”的差,并使用np.where并移位以最终填充列10mMark。

# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum() 
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number 
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)

删除之前的最终结果,看看创建的结果是什么

   Time  InvInstance  block  Time_last   m  10mMark
0     5            5    1.0         20  15        0
1     8            4    1.0         20  12        0
2     9            3    1.0         20  11        0
3    19            2    1.0         20   1        1
4    20            1    1.0         20   0        0
5     3            3    2.0         13  10        0
6     8            2    2.0         13   5        1
7    13            1    2.0         13   0        0

其中10mMark列具有预期结果

它效率不及使用Numba @MaxU 解决方案,但由于df有8000行,如他所用,我得到了大约350。