从Pandas DataFrame上次出现以来的日子?

时间:2017-06-07 19:04:09

标签: python performance date pandas numpy

假设我有一个Pandas DataFrame <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <script src="./files/jquery-1.11.2.min.js"></script> <script src="./files/bootstrap.min.js"></script> <link rel="stylesheet" href="./files/font-awesome.min.css"> <style> body { font-family: "Helvetica Neue", Helvetica, Arial, NanumBarunGothic, NanumGothic, "Apple SD Gothic Neo", sans-serif; } a { font-size: 36px; font-weight: 500; text-decoration: none; transition: color 0.3s; color: #0099cc; background-color: transparent; box-sizing: border-box; } a:hover { color: #4dd2ff; outline: none; border-bottom: 1px dotted; } hr { margin-bottom: 23px; border: 0; border-top: 1px solid #b8b8b8; } .button2 { position: absolute; } </style> <script> function alertKWEB() { window.alert("Me too"); } function alertKWEB2() { window.alert("K★W★E★B"); } function moveButtonRand() { var buttonTag=document.getElementsByClassName('button2'); var positionTop=Math.floor(Math.random()*90+5); var positionLeft=Math.floor(Math.random()*90+5); buttonTag.style.top=positionTop+"%"; buttonTag.style.left=positionLeft+"%"; } </script> </head> <body> <div class="main" style="text-align: center; width: 100%; height: 100%"> <h1><a href="https://kweb.korea.ac.kr/">Do you love KWEB?</a></h1> <hr> <button onclick="alertKWEB()">I do</button> <button class="button2" onclick="alertKWEB2()" onmouseover="moveButtonRand()">.....</button> </div> </body> </html>

df

对于每一行,我想有效地计算自上次出现Date Value 01/01/17 0 01/02/17 0 01/03/17 1 01/04/17 0 01/05/17 0 01/06/17 0 01/07/17 1 01/08/17 0 01/09/17 0 以来的天数。

那样Value=1

df

我可以做一个循环:

Date      Value    Last_Occurence
01/01/17  0        NaN
01/02/17  0        NaN
01/03/17  1        0
01/04/17  0        1
01/05/17  0        2
01/06/17  0        3
01/07/17  1        0
01/08/17  0        1
01/09/17  0        2

但对于极大的数据集而言似乎效率非常低,而且可能无论如何都不正确。

4 个答案:

答案 0 :(得分:6)

这是一种NumPy方法 -

def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
    out = np.ones(a.size,dtype=int)    
    idx = np.flatnonzero(a==trigger_val)
    if len(idx)==0:
        return np.full(a.size,invalid_specifier)
    else:
        out[idx[0]] = -idx[0] + 1
        out[0] = start_val
        out[idx[1:]] = idx[:-1] - idx[1:] + 1
        np.cumsum(out, out=out)
        out[:idx[0]] = invalid_specifier
        return out

在阵列数据上运行的示例很少,以展示涵盖触发器和起始值的各种场景的用法:

In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])

In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
     ...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
     ...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
     ...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
     ...: 

In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]: 
array([[ 0,  1,  1,  1,  0,  0,  1,  0,  0,  1,  1,  1,  1,  1,  0],
       [-1,  0,  0,  0,  1,  2,  0,  1,  2,  0,  0,  0,  0,  0,  1],
       [-1,  1,  1,  1,  2,  3,  1,  2,  3,  1,  1,  1,  1,  1,  2],
       [ 0,  1,  2,  3,  0,  0,  1,  0,  0,  1,  2,  3,  4,  5,  0],
       [ 1,  2,  3,  4,  1,  1,  2,  1,  1,  2,  3,  4,  5,  6,  1]])

用它来解决我们的案例:

df['Last_Occurence'] = intervaled_cumsum(df.Value.values)

示例输出 -

In [181]: df
Out[181]: 
       Date  Value  Last_Occurence
0  01/01/17      0              -1
1  01/02/17      0              -1
2  01/03/17      1               0
3  01/04/17      0               1
4  01/05/17      0               2
5  01/06/17      0               3
6  01/07/17      1               0
7  01/08/17      0               1
8  01/09/17      0               2

运行时测试

方法 -

# @Scott Boston's soln
def pandas_groupby(df):
    mask = df.Value.cumsum().replace(0,False).astype(bool)
    return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
                                    cumsum()).cumcount().where(mask))

# Proposed in this post
def numpy_based(df):
    df['Last_Occurence'] = intervaled_cumsum(df.Value.values)

计时 -

In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])

In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop

In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop

In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])

In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop

In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop

答案 1 :(得分:2)

让我们使用cumsumcumcountgroupby尝试此操作:

mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)

输出:

       Date  Value  Last_Occurance
0  01/01/17      0             NaN
1  01/02/17      0             NaN
2  01/03/17      1             0.0
3  01/04/17      0             1.0
4  01/05/17      0             2.0
5  01/06/17      0             3.0
6  01/07/17      1             0.0
7  01/08/17      0             1.0
8  01/09/17      0             2.0

答案 2 :(得分:1)

您不必在for循环中的每一步都将值更新为begin try truncata table dbo.YourTableName; end try begin catch delete from dbo.YourTableName; end catch 。在循环外部启动变量

last

并仅在last = np.nan for i in range(len(df)): if df.loc[i, 'Value'] == 1: last = i df.loc[i, 'Last_Occurence'] = i - last 列中出现1时更新。

请注意,无论您选择何种方法,迭代整个表一次都是不可避免的。

答案 3 :(得分:1)

您可以使用argmax:

df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]: 
0    0
1    0
2    0
3    1
4    2
5    3
6    0
7    1
8    2
dtype: int64

如果前两行必须有nan,请使用:

df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
                   if 1 in df.iloc[x.name::-1].Value.values \
                   else np.nan,axis=1)
Out[86]: 
0    NaN
1    NaN
2    0.0
3    1.0
4    2.0
5    3.0
6    0.0
7    1.0
8    2.0
dtype: float64