Pandas read_csv文件导入错误

时间:2013-10-18 14:00:20

标签: python csv encoding pandas

我正在尝试在Pandas中导入csv文件,但它会抛出错误。在notepad ++中打开时的数据格式如下,第一行是列名:

"End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date"
"11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,""Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw"",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,""Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho"",125.85,1,FY07,12/28/2006"
"12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008"
"12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008"
"11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008"
"13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008"
"7872800,Science Application International Corporation,2839760,GOVERNMENT OF KARNATAKA,10237455,Cubic Computing P.L,Server & CAL,SQL Server & CAL,SQL Server Standard,SQL Server Standard Edition,Non-specific,Open,Open SA/UA,Deferred Open SA - Renewal,0,0,FY09,1/15/2009"
"13096361,Pratham Software Pvt Ltd,13096361,Pratham Software Pvt Ltd,10133086,Krap Computer,Information Worker,Office,Office Standard / Basic,Office Standard,2007,Open,Open L,Open Std,7132.44,28,FY09,9/24/2008"
"12192276,Texmo Precision Castings,12192276,Texmo Precision Castings,4059430,Quadra Systems. - Partner,Server & CAL,Windows Server & CAL,Windows Standard Server,Windows Server Standard,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,11/15/2008"

请注意,双击csv格式的同一文件在excel中以逗号分隔值打开,但每行中没有引号,如记事本++中所示。

我将编码用作UTF-8,会出现以下错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 13: invalid start byte

然后先使用encoding ='cp1252'然后尝试使用latin1。

df=pd.read_csv(filename,encoding='cp1252') 

or 

df=pd.read_csv(filename,encoding='latin1')

使用这两种编码,它没有给出任何错误,数据已导入但是作为一个单独的列而不是不同的列。

是否与数据中每行之前的“”标记有关?我有一个类似的逗号分隔值的csv文件,但是在每一行中都没有双引号,并且使用cp1252和latin1都正确导入了。但是对于UTF-8却不是这样,即使文件是以记事本++中的utf8格式保存的。但在这种情况下,utf8不像往常一样工作,其他两个将它作为单列导入。

请告知。

由于

1 个答案:

答案 0 :(得分:0)

我很确定引号会导致它解释为转义中的所有逗号。所以,你需要将它们全部剥离。这样做比较简单,但由于unicode问题我会疯狂并建议你阅读它,删除引号然后将其写入文件以与read_csv一起使用(因为它会简化编码问题)。

以下是如何写入文件并删除引号,写入新文件,然后使用read_csv读入:

with open(filename) as infile, open(tmpfile, 'wb') as outfile:
    for line in infile:
        outfile.write(line.strip('"'))

result = pd.read_csv(tmpfile, encoding='cp1252')

您也想要在阅读完毕后删除临时文件。

我建议像上面这样做的原因是因为你在传递给StringIO缓冲区时避免处理编码/解码 - 对Python和pandas来说都很挑剔。