csv中的双引号元素不能用pandas读取

时间:2014-10-27 19:59:22

标签: python csv pandas

我有一个输入文件,其中每个值都存储为字符串。 它位于一个csv文件中,每个条目都在双引号内。

示例文件:

"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

只有六列。我需要输入哪些选项来pandas read_csv才能正确读取它?

我目前正在尝试:

import pandas as pd
df = pd.read_csv(file, quotechar='"')

但是这给了我错误信息: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14

这显然意味着它忽略了'''并将每个逗号解析为一个字段。 但是,对于第3行,第3列到第6列应该是包含逗号的字符串。 (" 1,2,3"," PR,SD,SD"," PR,SD,SD"," PR,SD,SD&# 34)

如何让pandas.read_csv正确解析?

感谢。

1 个答案:

答案 0 :(得分:11)

这会奏效。它回退到python解析器(因为你有非常规的分隔符,例如它们是逗号,有时是空格)。如果你只有逗号,它将使用c-parser并且速度更快。

In [1]: import csv

In [2]: !cat test.csv
"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL)
pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  ParserWarning)
Out[3]: 
     "column1","column2" "column3"   "column4"   "column5"   "column6"
"AM"                "07"       "1"        "SD"        "SD"        "CR"
"AM"                "08"   "1,2,3"  "PR,SD,SD"  "PR,SD,SD"  "PR,SD,SD"
"AM"                "01"       "2"        "SD"        "SD"        "SD"