Pandas dataframe:添加过去统计为类似事件的列

时间:2013-09-15 00:38:18

标签: python numpy pandas

我有一个谜题。这很容易excel。但是,在pandas中,使用数据帧df:

   |  EventID  |  PictureID  |  Date
0  |  1        |  A          |  2010-01-01
1  |  2        |  A          |  2010-02-01
2  |  3        |  A          |  2010-02-15
3  |  4        |  B          |  2010-01-01
4  |  5        |  C          |  2010-02-01
5  |  6        |  C          |  2010-02-15

有没有办法添加一个新列,它计算相同PictureID在过去6个月内有记录事件的次数?换句话说,数据框中的行数与给定行具有相同的PictureID,并且在给定行的日期之前的六个月内具有日期。

df['PastSix'] = ???

所以输出看起来像:

   |  EventID  |  PictureID  |  Date        |  PastSix
0  |  1        |  A          |  2010-01-01  |  0
1  |  2        |  A          |  2010-02-01  |  1
2  |  3        |  A          |  2010-02-15  |  2
3  |  4        |  B          |  2010-01-01  |  0
4  |  5        |  C          |  2010-02-01  |  0
5  |  6        |  C          |  2010-02-15  |  1

1 个答案:

答案 0 :(得分:2)

我不知道如何定义6个月,所以我使用前183天代替,基本思路是使用asof()方法:

import pandas as pd
import numpy as np
import io

txt = u"""EventID  |  PictureID  |  Date
0        |  A          |  2009-07-01
1        |  A          |  2010-01-01
2        |  A          |  2010-02-01
3        |  A          |  2010-02-15
4        |  B          |  2010-01-01
5        |  C          |  2010-02-01
6        |  C          |  2010-02-15
7        |  A          |  2010-08-01
"""

df = pd.read_csv(io.StringIO(txt), sep=r"\s*\|\s*", parse_dates=["Date"])

def f(df):
    count = pd.Series(np.arange(1, len(df)+1), index=df["Date"])
    prev1day = count.index.shift(-1, freq="D")
    prev6month = count.index.shift(-183, freq="D")
    result = count.asof(prev1day).fillna(0).values - count.asof(prev6month).fillna(0).values
    return pd.Series(result, df.index)

df["PastSix"] = df.groupby("PictureID").apply(f)
print df

输出:

   EventID PictureID                Date  PastSix
0        0         A 2009-07-01 00:00:00        0
1        1         A 2010-01-01 00:00:00        0
2        2         A 2010-02-01 00:00:00        1
3        3         A 2010-02-15 00:00:00        2
4        4         B 2010-01-01 00:00:00        0
5        5         C 2010-02-01 00:00:00        0
6        6         C 2010-02-15 00:00:00        1
7        7         A 2010-08-01 00:00:00        2