问题已解决。
我正在从几个|中创建一个数据框分离的文件。我读入了我的数据,格式化了我的日期列,然后将我的日期设置为日期时间索引。我想要的输出是带有时间戳的数据帧,以便我可以按时间分组进行分组。当我运行代码来为索引添加时间戳时,我得到一个错误,该错误包含在我的代码中,并且在没有实现时间戳的情况下输出:
import numpy as np
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, sep='|', header=None, low_memory=False, names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', \
'12', '13', 'date', '15', '16', '17', '18', '19', '20', \
'21', '22'], index_col=None, dtype={'date':str}) for f in \
glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))
df['date'].dropna()
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
df1 = df.set_index('date')
print (df1)
cmte_id trans_typ entity_typ state amount
fec_id cand_id
date
2007-08-15 C00112250 24K ORG DC 2000 C00431569 P00003392
2007-09-26 C00119040 24K CCM FL 1000 C00367680 H2FL05127
2007-09-26 C00119040 24K CCM MD 1000 C00140715 H2MD05155
我的错误:
KeyError: 'date'
18 df2 = df1.set_index(pd.to_datetime(df1['date']), inplace=True)
我的原始数据:
C00112250|N|Q3|G|27931381854|24K|ORG|HILLARY CLINTON FOR PRESIDENT EXP. COMM.|WASHINGTON|DC|20013|||08152007|2000|C00431569|P00003392|71006.E7975|307490|||4101720071081637544
C00119040|N|Q3|G|27990795873|24K|CCM|FRIENDS OF GINNY BROWN-WAITE|BROOKSVILLE|FL|34605|||09262007|1000|C00367680|H2FL05127|SB21.4307|307491|||4101720071081637552
C00119040|N|Q3|G|27990795873|24K|CCM|HOYER FOR CONGRESS|CLINTON|MD|20735|||09262007|1000|C00140715|H2MD05155|SB21.4303|307491|||4101720071081637553
答案 0 :(得分:0)
我认为您可以使用read_csv
参数usecols
来过滤列,使用date_parser
来设置datetime
:
import pandas as pd
import glob
dateparse = lambda x: pd.to_datetime(x, format='%m%d%Y')
#change path by your
df = pd.concat((pd.read_csv(f,
sep='|',
header=None,
names=['cmte_id', '2', '3', '4', '5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'],
usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id'],
parse_dates=[6],
date_parser=dateparse) for f in glob.glob('test/itpas2_data/itpas2**.txt')), ignore_index=True)
#reorder columns
df = df[['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id']]
print df
date cmte_id trans_typ entity_typ state employer occupation \
0 2007-08-15 C00112250 24K ORG DC NaN NaN
1 2007-09-26 C00119040 24K CCM FL NaN NaN
2 2007-09-26 C00119040 24K CCM MD NaN NaN
3 2011-02-25 C00478404 24K COM MN NaN NaN
4 2011-02-01 C00140855 24K CCM DC NaN NaN
5 2011-02-01 C00140855 24K CCM DC NaN NaN
6 2011-02-22 C00140855 24K CCM MD NaN NaN
7 2011-02-28 C00093963 24K CCM ND NaN NaN
amount fec_id cand_id
0 2000 C00431569 P00003392
1 1000 C00367680 H2FL05127
2 1000 C00140715 H2MD05155
3 2400 C00326629 H8MN06047
4 1000 C00373464 H2OH17109
5 1000 C00289983 H4KY01040
6 2500 C00140715 H2MD05155
7 1000 C00474619 H0ND00135
print df.dtypes
date datetime64[ns]
cmte_id object
trans_typ object
entity_typ object
state object
employer float64
occupation float64
amount int64
fec_id object
cand_id object
dtype: object
#if you need copy of column date to index
df.set_index(df['date'], inplace=True)
print df
date cmte_id trans_typ entity_typ state employer \
date
2007-08-15 2007-08-15 C00112250 24K ORG DC NaN
2007-09-26 2007-09-26 C00119040 24K CCM FL NaN
2007-09-26 2007-09-26 C00119040 24K CCM MD NaN
2011-02-25 2011-02-25 C00478404 24K COM MN NaN
2011-02-01 2011-02-01 C00140855 24K CCM DC NaN
2011-02-01 2011-02-01 C00140855 24K CCM DC NaN
2011-02-22 2011-02-22 C00140855 24K CCM MD NaN
2011-02-28 2011-02-28 C00093963 24K CCM ND NaN
occupation amount fec_id cand_id
date
2007-08-15 NaN 2000 C00431569 P00003392
2007-09-26 NaN 1000 C00367680 H2FL05127
2007-09-26 NaN 1000 C00140715 H2MD05155
2011-02-25 NaN 2400 C00326629 H8MN06047
2011-02-01 NaN 1000 C00373464 H2OH17109
2011-02-01 NaN 1000 C00289983 H4KY01040
2011-02-22 NaN 2500 C00140715 H2MD05155
2011-02-28 NaN 1000 C00474619 H0ND00135
#if you DONT need copy of column date to index
df.set_index('date', inplace=True)
print df
cmte_id trans_typ entity_typ state employer occupation \
date
2007-08-15 C00112250 24K ORG DC NaN NaN
2007-09-26 C00119040 24K CCM FL NaN NaN
2007-09-26 C00119040 24K CCM MD NaN NaN
2011-02-25 C00478404 24K COM MN NaN NaN
2011-02-01 C00140855 24K CCM DC NaN NaN
2011-02-01 C00140855 24K CCM DC NaN NaN
2011-02-22 C00140855 24K CCM MD NaN NaN
2011-02-28 C00093963 24K CCM ND NaN NaN
amount fec_id cand_id
date
2007-08-15 2000 C00431569 P00003392
2007-09-26 1000 C00367680 H2FL05127
2007-09-26 1000 C00140715 H2MD05155
2011-02-25 2400 C00326629 H8MN06047
2011-02-01 1000 C00373464 H2OH17109
2011-02-01 1000 C00289983 H4KY01040
2011-02-22 2500 C00140715 H2MD05155
2011-02-28 1000 C00474619 H0ND00135