Pandas没有准确地将标题与csv的值匹配

时间:2017-07-11 17:04:57

标签: python csv pandas

我正在尝试创建一个数据透视表来使用pandas分析数据。我的数据位于没有标题的csv文件(data.csv)中。通过pandas读取时,我将以下数组附加到文件的顶部:

Labels = ['voter_id_org','State ID','city','ward','pct','name_last','name_first','name_middle','name_suffix','Status,party','Registration Date','Last Registration Date','house_no','pre_dir','street','apartment','zip','birth_date','voter_id','Source','P_05_02_2017','S_12_06_2016','G_11_08_2016','S_08_02_2016','S_06_21_2016','P_03_15_2016','S_12_08_2015','G_11_03_2015','P_09_08_2015','P_05_05_2015','S_02_03_2015','G_11_04_2014','S_08_05_2014','P_05_06_2014','G_11_05_2013','P_10_01_2013','P_09_10_2013','S_08_06_2013','P_05_07_2013','G_11_06_2012','S_08_07_2012','P_03_06_2012','G_11_08_2011','P_09_13_2011','S_08_02_2011','P_05_03_2011','S_02_08_2011','G_11_02_2010','P_09_07_2010','S_08_03_2010','P_05_04_2010','G_11_03_2009','P_09_29_2009','P_09_08_2009','S_08_04_2009','P_05_05_2009','S_02_03_2009','SG_12_23_2008','SG_11_18_2008','G_11_04_2']

但是,我无法通过标签准确引用特定列,因此我的数据透视表是空的。当csv严格以逗号分隔时,我的代码会创建一个数据透视表,所以我认为问题是"中行之间的data.csv。如何正确读取此文件以便我可以访问每个列?

data.csv:

547212,OH0014718999,CLEVELAND,03,H,JOHNSON,JAMES,M,,A,NOPTY,01/01/1901,09/19/2016,1500,,DETROIT AVE,     APT 505,44113,1959,547212,VOTER PARTICIPATION CENTER,,,Y,,,,,,,,,,,,Y,,,,,Y,,,Y,,,,,Y,,,,Y,,,,,,,,Y,,,,D,,,,,,,,,,Y,,,,,CLEV CSD,CONG 11,HSE 10,SEN 21,CLE MCD,"CCD 07
"
652898,OH0014779218,CLEVELAND,03,Q,WOLSTEIN,JILLIAN,MARCY,,A,NOPTY,01/01/1901,03/22/2017,1055,,OLD RIVER RD,     APT 811,44113,1960,652898,5 - RECEIVED IN MAIL,,,Y,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CLEV CSD,CONG 11,HSE 10,SEN 21,CLE MCD,"CCD 07
"
2417233,OH0020357576,CLEVELAND,07,J,PYNE,DANIEL,J,,I,NOPTY,10/06/2008,10/06/2008,1701,E, 12TH ST,         14Q,44114,1984,2417233,SECRETARY OF STATE S OFFICE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Y,,,,,,,,Y,,,,,,,,,,,,,,,,,,,CLEV CSD,CONG 11,HSE 10,SEN 21,CLE MCD,"CCD 07
"
2407693,OH0020299723,CLEVELAND,03,H,ANGELO,CELIA,E,,A,NOPTY,10/06/2008,07/08/2015,1500,,DETROIT AVE,     APT 102,44113,1985,2407693,5 - RECEIVED IN MAIL,,,Y,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Y,,,,,,,,,,,,,,,,,,,CLEV CSD,CONG 11,HSE 10,SEN 21,CLE MCD,"CCD 07
    "
...

我的档案:

def analyzefile(file):
    f = pd.read_csv(file,header=None,names=labels)
    pt = pd.pivot_table(f,index=['State ID'], aggfunc='count')
    print pt

1 个答案:

答案 0 :(得分:1)

您无法准确引用数据框中的特定列,因为df.columns的长度为85,Labels列表的长度为60.如果您想像这样转动数据框,可以这样做。

df = pd.read_csv('Data.csv',delimiter=',',header=None)
pd.pivot_table(df,index=1,aggfunc='count')

问题不在于"中行之间的data.csv,因为它们是该行中最后一项的结束"