使用dataframe的Pandas resample只返回两个字段而不是四个字段

时间:2014-04-03 16:58:41

标签: python pandas

我在DataFrame df中包含下表:

date         val1   val2    user_id  val3      val4    val5    val6
01/01/2011  1   100 3    sterling  100     3       euro
01/02/2013  20  8        sterling  12      15      euro
01/07/2012      19  57   sterling  9       6       euro     
01/11/2014  3100    49  6        sterling  15      3       euro
21/12/2012          240  sterling  240     30      euro 
14/09/2013      21  63   sterling  34      23      euro         
01/12/2013  3200    51  20       sterling  93      56      euro

用于获取上表的代码是:

import pandas as pd

myheaders= ['date','val1', 'val1','val2', 'val3','val4','user_id','val5','val6']
df = pd.read_csv('mytest.csv', names = myheaders, header = False, parse_dates=True, dayfirst=True)
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.loc[:,['date','user_id','val1','val2','val3','val4', 'val5', 'val6']]
df['date'] = pd.to_datetime(df['date'], dayfirst=True) 
df1 = df.pivot('date', 'user_id')

但是,我想知道添加语句的原因df2 = df1.resample(' M') 在最后一个代码的末尾,我获得了一个看起来像(只是字段)的数据帧df2             val1 val5 用户身份 日期

而不是像:

        val1  val2  val3  val4  val5  val6

USER_ID 日期

提前感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

如果您有DatetimeIndex:

,则可以对groupby进行重新取样
In [11]: df
Out[11]:
        date  val1  val2  user_id  val3  val4  val5  val6
0 2011-01-01     1   100        3     5   100     3     5
1 2013-01-02    20     8        6    12    15     3   NaN
2 2012-01-07    19    57       10     9     6     6   NaN
3 2014-01-11  3100    49        6    12    15     3   NaN
4 2012-12-21   240    30      240    30   NaN   NaN   NaN
5 2013-09-14    21    63       90    34    23     6   NaN
6 2013-01-12  3200    51       20    50    93    56   NaN

In [12]: df2 = df.set_index('date')  # now you have a DatetimeIndex

In [13]: df2
Out[13]:
            val1  val2  user_id  val3  val4  val5  val6
date
2011-01-01     1   100        3     5   100     3     5
2013-01-02    20     8        6    12    15     3   NaN
2012-01-07    19    57       10     9     6     6   NaN
2014-01-11  3100    49        6    12    15     3   NaN
2012-12-21   240    30      240    30   NaN   NaN   NaN
2013-09-14    21    63       90    34    23     6   NaN
2013-01-12  3200    51       20    50    93    56   NaN

In [14]: df2.groupby('user_id').resample('M').dropna(how='all')
Out[14]:
                    val1  val2  user_id  val3  val4  val5  val6
user_id date
3       2011-01-31     1   100        3     5   100     3     5
6       2013-01-31    20     8        6    12    15     3   NaN
        2014-01-31  3100    49        6    12    15     3   NaN
10      2012-01-31    19    57       10     9     6     6   NaN
20      2013-01-31  3200    51       20    50    93    56   NaN
90      2013-09-30    21    63       90    34    23     6   NaN
240     2012-12-31   240    30      240    30   NaN   NaN   NaN