计算跨数据帧行的出现次数

时间:2016-04-19 00:54:10

标签: python pandas

我发现回应接近这个,但没有什么可以解决这个问题。我有一个看起来像这样的数据表:

ID          DATE
74180       11/07/2000
74180       11/04/2008
81337       11/04/2008
81337       11/02/2010
82557       11/07/2000
82557       11/05/2002
82557       11/02/2004
82557       11/04/2008
82557       11/06/2012
82901       11/07/2000
82901       11/05/2002
82901       11/02/2004
82901       11/04/2008
82901       11/06/2012
82901       11/04/2014
83103       11/04/2008
83103       11/02/2010
83103       11/06/2012
83103       11/04/2014

我想转换它,以便每个ID占用一行,各个日期表示为二进制列,即:

ID        11/07/2000   11/05/2002   11/02/2004 ...
74180     1              0           0 
81337     0              0           0

非常感谢任何指导。

2 个答案:

答案 0 :(得分:0)

考虑:

df.set_index('ID', inplace=True)
pd.get_dummies(df.loc[:, 'DATE']).groupby(level='ID').sum()

       2000-11-07  2002-11-05  2004-11-02  2008-11-04  2010-11-02  2012-11-06  \
ID                                                                              
74180         1.0         0.0         0.0         1.0         0.0         0.0   
81337         0.0         0.0         0.0         1.0         1.0         0.0   
82557         1.0         1.0         1.0         1.0         0.0         1.0   
82901         1.0         1.0         1.0         1.0         0.0         1.0   
83103         0.0         0.0         0.0         1.0         1.0         1.0   

       2014-11-04  
ID                 
74180         0.0  
81337         0.0  
82557         0.0  
82901         1.0  
83103         1.0

答案 1 :(得分:0)

首先,重新创建DataFrame:

ID = [74180,74180,81337,81337,82557,82557,82557,82557,82557,82901,82901,82901,82901,82901,82901,83103,83103,83103,83103]
DATE = ['2000-11-07','2008-11-04','2008-11-04','2010-11-02','2000-11-07','2002-11-05','2004-11-02','2008-11-04','2012-11-06','2000-11-7','2002-11-05','2004-11-02','2008-11-04','2012-11-06','2014-11-04','2008-11-04','2010-11-02','2012-11-06','2014-11-04']
df = pd.DataFrame({'ID':ID, 'DATE':DATE})

实际处理:

df2 = pd.get_dummies(df.set_index('ID')['DATE'])
df2.reset_index().groupby('ID').sum()

输出:

       2000-11-07  2000-11-7  2002-11-05  2004-11-02  2008-11-04 ...
ID
74180         1.0        0.0         0.0         0.0         1.0 ...
81337         0.0        0.0         0.0         0.0         1.0 ...
82557         1.0        0.0         1.0         1.0         1.0 ...
82901         0.0        1.0         1.0         1.0         1.0 ...
83103         0.0        0.0         0.0         0.0         1.0 ...