检测大熊猫的每日复发

时间:2015-06-10 08:46:24

标签: python pandas

我有像这样的Pandas数据框

from datetime import timedelta
import pandas as pd

df = pd.DataFrame({'Team':pd.np.random.choice(['CHI', 'DAL'], 20),
               'Date':pd.date_range('2014-11-01', '2014-11-20')})
df.drop(14, inplace=True)
df
    Date    Team
0   2014-11-01  DAL
1   2014-11-02  CHI
2   2014-11-03  CHI
3   2014-11-04  DAL
4   2014-11-05  CHI
5   2014-11-06  CHI
6   2014-11-07  DAL
7   2014-11-08  DAL
8   2014-11-09  DAL
9   2014-11-10  DAL
10  2014-11-11  CHI
11  2014-11-12  CHI
12  2014-11-13  CHI
13  2014-11-14  CHI
# Notice there is no day here.
15  2014-11-16  CHI
16  2014-11-17  CHI
17  2014-11-18  CHI
18  2014-11-19  CHI
19  2014-11-20  DAL

我想找到一支球队连续比赛的天数。

2 个答案:

答案 0 :(得分:1)

以下内容应该更加优化,基本上我是团队中的groupby,应用布尔测试来判断日期时间的差异是否等于1天的时间值。

然后,如果这是True,则对此应用cumsum并添加1。

然后填写NaN值:

In [51]:
df['consec_days'] = df.sort('Date').groupby('Team')['Date'].apply(lambda x: x.diff() == dt.timedelta(1)) 
df.loc[df['consec_days'] == True,'n_days'] = df.loc[df['consec_days']==True].groupby('Team')['consec_days'].apply(pd.Series.cumsum) + 1
df['n_days'] = df['n_days'].fillna(1)
df

Out[51]:
            Date Team consec_days  n_days
index                                    
0     2014-11-01  DAL       False       1
1     2014-11-02  CHI       False       1
2     2014-11-03  DAL       False       1
3     2014-11-04  CHI       False       1
4     2014-11-05  DAL       False       1
5     2014-11-06  DAL        True       2
6     2014-11-07  DAL        True       3
7     2014-11-08  DAL        True       4
8     2014-11-09  CHI       False       1
9     2014-11-10  DAL       False       1

答案 1 :(得分:0)

对我的知识这样做的唯一方法是迭代。这不像矢量化函数那样最优,但由于你需要在n行之间传递信息,我不认为矢量化是可能的。

因此我提出了这个算法:

n_days_in_row_played = 1
last_team = ""
last_date = datetime(1,1,1)
n_days = []
for row in df[(['Date', 'Team'])].iterrows():
    i, data = row
    date, team = data
    if team != last_team or (date - last_date).days > 1:
        last_team = team
        n_days_in_row_played = 1
    else:
        n_days_in_row_played += 1
    n_days.append(n_days_in_row_played)
    last_date = date
df['n_days'] = n_days
df
    Date    Team    n_days
0   2014-11-01  DAL     1
1   2014-11-02  CHI     1
2   2014-11-03  CHI     2
3   2014-11-04  DAL     1
4   2014-11-05  CHI     1
5   2014-11-06  CHI     2
6   2014-11-07  DAL     1
7   2014-11-08  DAL     2
8   2014-11-09  DAL     3
9   2014-11-10  DAL     4
10  2014-11-11  CHI     1
11  2014-11-12  CHI     2
12  2014-11-13  CHI     3
13  2014-11-14  CHI     4
# Skipped day resets the count.
15  2014-11-16  CHI     1
16  2014-11-17  CHI     2
17  2014-11-18  CHI     3
18  2014-11-19  CHI     4
19  2014-11-20  DAL     1

我们记住最后一支球队是什么,最后一次比赛是

对于每一个新行,我们会比较团队是否更改或者是否超过一天,那么天数就会被打破,所以我们会重置。

否则我们在连续几天内加上一个加号

最后,我们将播放日期的值附加到我们可以附加到原始数据框的列表中。