设置日期列以索引date.time pandas python

时间:2016-03-10 14:16:03

标签: python pandas

问题已解决。

我正在从几个|中创建一个数据框分离的文件。我读入了我的数据,格式化了我的日期列,然后将我的日期设置为日期时间索引。我想要的输出是带有时间戳的数据帧,以便我可以按时间分组进行分组。当我运行代码来为索引添加时间戳时,我得到一个错误,该错误包含在我的代码中,并且在没有实现时间戳的情况下输出:

import numpy as np
import pandas as pd
import glob


df = pd.concat((pd.read_csv(f, sep='|', header=None, low_memory=False, names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', \
                                                            '12', '13', 'date', '15', '16', '17', '18', '19', '20', \
                                                            '21', '22'], index_col=None, dtype={'date':str}) for f in \
                glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))


df['date'].dropna()

df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')

df1 = df.set_index('date')



print (df1)
               cmte_id trans_typ entity_typ state  amount 

    fec_id    cand_id
date                                                                          
2007-08-15  C00112250       24K        ORG    DC    2000  C00431569  P00003392
2007-09-26  C00119040       24K        CCM    FL    1000  C00367680  H2FL05127
2007-09-26  C00119040       24K        CCM    MD    1000  C00140715  H2MD05155

我的错误:

KeyError: 'date'
18 df2 = df1.set_index(pd.to_datetime(df1['date']), inplace=True)

我的原始数据:

C00112250|N|Q3|G|27931381854|24K|ORG|HILLARY CLINTON FOR PRESIDENT EXP. COMM.|WASHINGTON|DC|20013|||08152007|2000|C00431569|P00003392|71006.E7975|307490|||4101720071081637544
C00119040|N|Q3|G|27990795873|24K|CCM|FRIENDS OF GINNY BROWN-WAITE|BROOKSVILLE|FL|34605|||09262007|1000|C00367680|H2FL05127|SB21.4307|307491|||4101720071081637552
C00119040|N|Q3|G|27990795873|24K|CCM|HOYER FOR CONGRESS|CLINTON|MD|20735|||09262007|1000|C00140715|H2MD05155|SB21.4303|307491|||4101720071081637553

1 个答案:

答案 0 :(得分:0)

我认为您可以使用read_csv参数usecols来过滤列,使用date_parser来设置datetime

import pandas as pd
import glob


dateparse = lambda x: pd.to_datetime(x, format='%m%d%Y')

#change path by your 
df = pd.concat((pd.read_csv(f, 
                            sep='|', 
                            header=None, 
                            names=['cmte_id', '2', '3', '4', '5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'], 
                            usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id'],
                            parse_dates=[6],
                            date_parser=dateparse) for f in glob.glob('test/itpas2_data/itpas2**.txt')), ignore_index=True)

#reorder columns
df = df[['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id']]
print df
        date    cmte_id trans_typ entity_typ state  employer  occupation  \
0 2007-08-15  C00112250       24K        ORG    DC       NaN         NaN   
1 2007-09-26  C00119040       24K        CCM    FL       NaN         NaN   
2 2007-09-26  C00119040       24K        CCM    MD       NaN         NaN   
3 2011-02-25  C00478404       24K        COM    MN       NaN         NaN   
4 2011-02-01  C00140855       24K        CCM    DC       NaN         NaN   
5 2011-02-01  C00140855       24K        CCM    DC       NaN         NaN   
6 2011-02-22  C00140855       24K        CCM    MD       NaN         NaN   
7 2011-02-28  C00093963       24K        CCM    ND       NaN         NaN   

   amount     fec_id    cand_id  
0    2000  C00431569  P00003392  
1    1000  C00367680  H2FL05127  
2    1000  C00140715  H2MD05155  
3    2400  C00326629  H8MN06047  
4    1000  C00373464  H2OH17109  
5    1000  C00289983  H4KY01040  
6    2500  C00140715  H2MD05155  
7    1000  C00474619  H0ND00135  

print df.dtypes
date          datetime64[ns]
cmte_id               object
trans_typ             object
entity_typ            object
state                 object
employer             float64
occupation           float64
amount                 int64
fec_id                object
cand_id               object
dtype: object
#if you need copy of column date to index
df.set_index(df['date'], inplace=True) 
print df
                 date    cmte_id trans_typ entity_typ state  employer  \
date                                                                    
2007-08-15 2007-08-15  C00112250       24K        ORG    DC       NaN   
2007-09-26 2007-09-26  C00119040       24K        CCM    FL       NaN   
2007-09-26 2007-09-26  C00119040       24K        CCM    MD       NaN   
2011-02-25 2011-02-25  C00478404       24K        COM    MN       NaN   
2011-02-01 2011-02-01  C00140855       24K        CCM    DC       NaN   
2011-02-01 2011-02-01  C00140855       24K        CCM    DC       NaN   
2011-02-22 2011-02-22  C00140855       24K        CCM    MD       NaN   
2011-02-28 2011-02-28  C00093963       24K        CCM    ND       NaN   

            occupation  amount     fec_id    cand_id  
date                                                  
2007-08-15         NaN    2000  C00431569  P00003392  
2007-09-26         NaN    1000  C00367680  H2FL05127  
2007-09-26         NaN    1000  C00140715  H2MD05155  
2011-02-25         NaN    2400  C00326629  H8MN06047  
2011-02-01         NaN    1000  C00373464  H2OH17109  
2011-02-01         NaN    1000  C00289983  H4KY01040  
2011-02-22         NaN    2500  C00140715  H2MD05155  
2011-02-28         NaN    1000  C00474619  H0ND00135  
#if you DONT need copy of column date to index
df.set_index('date', inplace=True) 
print df
              cmte_id trans_typ entity_typ state  employer  occupation  \
date                                                                     
2007-08-15  C00112250       24K        ORG    DC       NaN         NaN   
2007-09-26  C00119040       24K        CCM    FL       NaN         NaN   
2007-09-26  C00119040       24K        CCM    MD       NaN         NaN   
2011-02-25  C00478404       24K        COM    MN       NaN         NaN   
2011-02-01  C00140855       24K        CCM    DC       NaN         NaN   
2011-02-01  C00140855       24K        CCM    DC       NaN         NaN   
2011-02-22  C00140855       24K        CCM    MD       NaN         NaN   
2011-02-28  C00093963       24K        CCM    ND       NaN         NaN   

            amount     fec_id    cand_id  
date                                      
2007-08-15    2000  C00431569  P00003392  
2007-09-26    1000  C00367680  H2FL05127  
2007-09-26    1000  C00140715  H2MD05155  
2011-02-25    2400  C00326629  H8MN06047  
2011-02-01    1000  C00373464  H2OH17109  
2011-02-01    1000  C00289983  H4KY01040  
2011-02-22    2500  C00140715  H2MD05155  
2011-02-28    1000  C00474619  H0ND00135