我已将数据框编入索引到日期列。现在我想设置索引to_datetime。我的代码如下:
import numpy as np
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, sep='|', header=None, index_col=None, low_memory=False) for f in glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))
df.columns = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', 'date', '15', '16', '17', '18', '19', '20', '21', '22']
df.set_index(pd.to_datetime(df['date']), inplace=True)
df1 = df[['1', '6', '7', '10', '12', '13', '15', '16', '17']].copy()
df1.columns = ['cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id']
Print (df1)
但是我的输出看起来是在追加一个新的日期列。
cmte_id trans_typ entity_typ state employer \
date
1970-01-01 00:00:00.008152007 C00112250 24K ORG DC NaN
1970-01-01 00:00:00.009262007 C00119040 24K CCM FL NaN
1970-01-01 00:00:00.009262007 C00119040 24K CCM MD NaN
1970-01-01 00:00:00.00
我的原始日期列是日期索引中的最后8位数字。此外,read.csv文件的前几行如下(第一行中的日期值为08152007):
C00112250|N|Q3|G|27931381854|24K|ORG|HILLARY CLINTON FOR PRESIDENT EXP. COMM.|WASHINGTON|DC|20013|||08152007|2000|C00431569|P00003392|71006.E7975|307490|||4101720071081637544
答案 0 :(得分:2)
好的,我发现您的问题会将read_csv
行更改为:
df = pd.concat((pd.read_csv(f, sep='|', header=None, names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', 'date', '15', '16', '17', '18', '19', '20', '21', '22'], index_col=None, dtype={'date':str}) for f in glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))
这会设置您的列名并强制将日期列视为str
dtype,然后将其视为int
,以便删除前导0
,然后您可以转换类型:
df.set_index(pd.to_datetime(df['date'], format='%m%d%Y), inplace=True)
示例:
In [336]:
import pandas as pd
import io
t="""C00112250|N|Q3|G|27931381854|24K|ORG|HILLARY CLINTON FOR PRESIDENT EXP. COMM.|WASHINGTON|DC|20013|||08152007|2000|C00431569|P00003392|71006.E7975|307490|||4101720071081637544"""
df = pd.read_csv(io.StringIO(t), sep='|', header=None, names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', 'date', '15', '16', '17', '18', '19', '20', '21', '22'], index_col=None, dtype={'date':str})
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 22 columns):
1 1 non-null object
2 1 non-null object
3 1 non-null object
4 1 non-null object
5 1 non-null int64
6 1 non-null object
7 1 non-null object
8 1 non-null object
9 1 non-null object
10 1 non-null object
11 1 non-null int64
12 0 non-null float64
13 0 non-null float64
date 1 non-null object
15 1 non-null int64
16 1 non-null object
17 1 non-null object
18 1 non-null object
19 1 non-null int64
20 0 non-null float64
21 0 non-null float64
22 1 non-null int64
dtypes: float64(4), int64(5), object(13)
memory usage: 184.0+ bytes
In [337]:
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 22 columns):
1 1 non-null object
2 1 non-null object
3 1 non-null object
4 1 non-null object
5 1 non-null int64
6 1 non-null object
7 1 non-null object
8 1 non-null object
9 1 non-null object
10 1 non-null object
11 1 non-null int64
12 0 non-null float64
13 0 non-null float64
date 1 non-null datetime64[ns]
15 1 non-null int64
16 1 non-null object
17 1 non-null object
18 1 non-null object
19 1 non-null int64
20 0 non-null float64
21 0 non-null float64
22 1 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(5), object(12)
memory usage: 184.0+ bytes
In [338]:
df['date']
Out[338]:
0 2007-08-15
Name: date, dtype: datetime64[ns]