熊猫重新采样丢失的行

时间:2018-07-11 11:59:59

标签: python pandas resampling

我有一个要在一周内重新采样的数据框:

df = 

         Date Game_Mode  Count
0  2008-11-30         b      1
1  2009-07-03         b      1
2  2009-07-12         b      1
3  2009-07-18         b      1
4  2009-10-02         c      1
5  2009-10-21         a      1
6  2009-10-22         b      1
7  2010-01-29         b      1
8  2010-01-31         b      1
9  2010-02-28         a      1
10 2010-03-28         a      1
11 2010-04-16         a      1
12 2010-05-09         a      1
13 2010-07-07         a      1
14 2010-09-16         e      1
15 2010-10-26         e      1
16 2010-12-16         e      1
17 2010-12-22         e      1
18 2011-07-20         e      1
19 2011-08-23         e      1

df['Date'][0]
Timestamp('2008-11-30 00:00:00')

我每周对类别'a'(在所有类别中都这样做)进行重新采样。

week = df[df['Game_Mode'] == 'a'].set_index('Date', drop=True).resample('1W-TUE').sum()['Count'].reset_index()
# wc.dropna(subset=['Count'], inplace=True)
week.reset_index(drop=True, inplace=True)
week['Date_Week_{}'.format('a')] = week['Date']
week['Index_Col_{}'.format('a')] = week.index + 1
week.rename(columns={'Count':'Count_{}'.format('a')}, inplace=True)

每周重采样:

week

         Date  Count_a Date_Week_a  Index_Col_a
0  2009-10-27      1.0  2009-10-27            1
1  2009-11-03      NaN  2009-11-03            2
2  2009-11-10      NaN  2009-11-10            3
3  2009-11-17      NaN  2009-11-17            4
4  2009-11-24      NaN  2009-11-24            5
5  2009-12-01      NaN  2009-12-01            6
6  2009-12-08      NaN  2009-12-08            7
7  2009-12-15      NaN  2009-12-15            8
8  2009-12-22      NaN  2009-12-22            9
9  2009-12-29      NaN  2009-12-29           10
10 2010-01-05      NaN  2010-01-05           11
11 2010-01-12      NaN  2010-01-12           12
12 2010-01-19      NaN  2010-01-19           13
13 2010-01-26      NaN  2010-01-26           14
14 2010-02-02      NaN  2010-02-02           15
15 2010-02-09      NaN  2010-02-09           16
16 2010-02-16      NaN  2010-02-16           17
17 2010-02-23      NaN  2010-02-23           18
18 2010-03-02      1.0  2010-03-02           19
19 2010-03-09      NaN  2010-03-09           20
20 2010-03-16      NaN  2010-03-16           21
21 2010-03-23      NaN  2010-03-23           22
22 2010-03-30      1.0  2010-03-30           23
23 2010-04-06      NaN  2010-04-06           24
24 2010-04-13      NaN  2010-04-13           25
25 2010-04-20      1.0  2010-04-20           26
26 2010-04-27      NaN  2010-04-27           27
27 2010-05-04      NaN  2010-05-04           28
28 2010-05-11      1.0  2010-05-11           29
29 2010-05-18      NaN  2010-05-18           30
30 2010-05-25      NaN  2010-05-25           31
31 2010-06-01      NaN  2010-06-01           32
32 2010-06-08      NaN  2010-06-08           33
33 2010-06-15      NaN  2010-06-15           34
34 2010-06-22      NaN  2010-06-22           35
35 2010-06-29      NaN  2010-06-29           36
36 2010-07-06      NaN  2010-07-06           37
37 2010-07-13      1.0  2010-07-13           38

这是我的问题。我丢失了2009-10-272008-12-30以及2010-07-132011-08-23至所有日期。在重采样期间如何丢失这些东西?

我想结束:

    week

         Date  Count_e Date_Week_e  Index_Col_e
   2008-12-02      NaN  2008-12-02            1
   2008-12-09      NaN  2008-12-09            2
          ...  # All weeks before 2009-10-27
          ...
          ... 
   2009-10-27      1.0  2009-10-27            X
   2009-11-03      NaN  2009-11-03            Y
          ...  
          ...
          ...
          # Standard resample in this period
   2010-07-06      NaN  2010-07-06           Z
   2010-07-13      1.0  2010-07-13           I
          ...
          ...
          ... # All weeks after 2010-07-13 up to:
   2011-08-23       NaN 2011-08-23           J

1 个答案:

答案 0 :(得分:1)

问题是由

引起的
df[df['Game_Mode'] == 'a']

如果仅选择Game_Modea(或其中之一)的df行,那么您将舍弃开始和结束日期。

您可以做的是创建一个空的数据框,该数据框的日期相同,但充满了NaN。例如

import numpy as np
temp = pd.DataFrame({'Date' : df['Date'],'Game_Mode' : 'a', 'Count': np.nan})

礼物:

          Date Game_Mode  Count
0   2008-11-30         a    NaN
1   2009-07-03         a    NaN
2   2009-07-12         a    NaN
3   2009-07-18         a    NaN
4   2009-10-02         a    NaN
5   2009-10-21         a    NaN
6   2009-10-22         a    NaN
7   2010-01-29         a    NaN
8   2010-01-31         a    NaN
9   2010-02-28         a    NaN
10  2010-03-28         a    NaN
11  2010-04-16         a    NaN
12  2010-05-09         a    NaN
13  2010-07-07         a    NaN
14  2010-09-16         a    NaN
15  2010-10-26         a    NaN
16  2010-12-16         a    NaN
17  2010-12-22         a    NaN
18  2011-07-20         a    NaN
19  2011-08-23         a    NaN

然后使用现有数据更新它(奇怪的是,无法内联):

temp.update(df[df['Game_Mode']=='a'])

礼物:

          Date Game_Mode  Count
0   2008-11-30         a    NaN
1   2009-07-03         a    NaN
2   2009-07-12         a    NaN
3   2009-07-18         a    NaN
4   2009-10-02         a    NaN
5   2009-10-21         a    1.0
6   2009-10-22         a    NaN
7   2010-01-29         a    NaN
8   2010-01-31         a    NaN
9   2010-02-28         a    1.0
10  2010-03-28         a    1.0
11  2010-04-16         a    1.0
12  2010-05-09         a    1.0
13  2010-07-07         a    1.0
14  2010-09-16         a    NaN
15  2010-10-26         a    NaN
16  2010-12-16         a    NaN
17  2010-12-22         a    NaN
18  2011-07-20         a    NaN
19  2011-08-23         a    NaN

如果您随后重新采样:

temp.set_index('Date').resample('1W-TUE').sum()['Count']

您获得了所有日期(.sum()不会为我返回NaN ...):

Date
2008-12-02   0.0
2008-12-09   0.0
2008-12-16   0.0
2008-12-23   0.0
              ..
2011-08-02   0.0
2011-08-09   0.0
2011-08-16   0.0
2011-08-23   0.0
Freq: W-TUE, Name: Count, Length: 143, dtype: float64