我正在关注Python for Data Analysis一书。它告诉我从http://www.fec.gov/disclosurep/PDownload.do获取ALL文件并用pandas加载它:
import pandas as pd
fec = pd.read_csv('P00000001-ALL.csv')
但自书写完以来,实际文件已经发生了变化。旧文件(可在此处https://github.com/pydata/pydata-book/blob/master/ch09/P00000001-ALL.csv)加载得很好
fec = pd.read_csv('../pydata-book/ch09/P00000001-ALL.csv')
但是新的加载错误,因为列似乎已经移位(第一列值被删除)
cmte_id P60008059
cand_id Bush, Jeb
cand_nm EASTON, AMY KELLY MRS.
contbr_nm KEY BISCAYNE
contbr_city FL
contbr_st 331491716
contbr_zip HOMEMAKER
contbr_employer HOMEMAKER
contbr_occupation 2700
contb_receipt_amt 26-JUN-15
contb_receipt_dt NaN
receipt_desc NaN
memo_cd NaN
memo_text SA17A
form_tp 1024106
file_num SA17.114991
tran_id P2016
election_tp NaN
实际行是
C00579458,"P60008059","Bush, Jeb","EASTON, AMY KELLY MRS.","KEY BISCAYNE","FL","331491716","HOMEMAKER","HOMEMAKER",2700,26-JUN-15,"","","","SA17A","1024106","SA17.114991","P2016",
因此C00579458在某个地方丢失了。
标题看起来像这样。 cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
答案 0 :(得分:1)
原始数据中每行末尾有一个额外的逗号。
C00458844,"P60006723","Rubio, Marco","HEFFERNAN, MICHAEL","APO","AE","090960009","INFORMATION REQUESTED PER BEST EFFORTS","INFORMATION REQUESTED PER BEST EFFORTS",210,27-JUN-15,"","","","SA17A","1015697","SA17.796904","P2016",
如果您有2个逗号,则每行将移动2列。
答案 1 :(得分:1)
正如另一个答案已经建议的那样,你在行的末尾有一个comma
的csv格式错误。因此,这会导致pandas将第一列视为索引列。
要解决此问题,您可以将index_col=False
参数传递给pandas.read_csv()
函数。示例 -
In [24]: s = io.StringIO("""cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
....: C00579458,"P60008059","Bush, Jeb","EASTON, AMY KELLY MRS.","KEY BISCAYNE","FL","331491716","HOMEMAKER","HOMEMAKER",2700,26-JUN-15,"","","","SA17A","1024106","SA17.114991","P2016",""")
In [25]: df = pd.read_csv(s) #Issue
In [26]: df
Out[26]:
cmte_id cand_id cand_nm contbr_nm \
C00579458 P60008059 Bush, Jeb EASTON, AMY KELLY MRS. KEY BISCAYNE
contbr_city contbr_st contbr_zip contbr_employer \
C00579458 FL 331491716 HOMEMAKER HOMEMAKER
contbr_occupation contb_receipt_amt contb_receipt_dt \
C00579458 2700 26-JUN-15 NaN
receipt_desc memo_cd memo_text form_tp file_num tran_id \
C00579458 NaN NaN SA17A 1024106 SA17.114991 P2016
election_tp
C00579458 NaN
In [29]: df = pd.read_csv(s,index_col=False) #No issue
In [30]: df
Out[30]:
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00579458 P60008059 Bush, Jeb EASTON, AMY KELLY MRS. KEY BISCAYNE
contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt \
0 FL 331491716 HOMEMAKER HOMEMAKER 2700
contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num \
0 26-JUN-15 NaN NaN NaN SA17A 1024106
tran_id election_tp
0 SA17.114991 P2016
中正确解释了这一点
index_col
:int或sequence或False,默认无要用作DataFrame的行标签的列。如果给出序列,则使用MultiIndex。 如果在每行末尾有一个带有分隔符的格式错误的文件,您可能会考虑使用index_col = False来强制pandas 而不是使用第一列作为索引(行名称)
(强调我的)