导入CSV文件,其值包含在"当其中一些包含"以及逗号

时间:2018-02-06 12:56:32

标签: python pandas csv quotes

我想我一直在搜索,但如果我错过了什么 - 请告诉我。

我正在尝试导入CSV文件,其中所有非数值都包含在"中。 我遇到了一个问题:

 df = pd.read_csv(file.csv)

CSV示例:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company "MoscowMining" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company" Jankowski,A,B""

由于其中有多个引号和逗号,在这种情况下,pandas会看到比4更多的列(如5或6)。

我已经尝试过玩

df = pd.read_csv(file.csv, quotechar='"', quoting=2)

但得到了

ParserError: Error tokenizing data (...)

通过

跳过坏线是有效的
error_bad_lines=False

但我宁愿以某种方式考虑所有数据,而不仅仅是忽略它。

非常感谢您的帮助!

2 个答案:

答案 0 :(得分:2)

这似乎是格式错误的CSV数据,因为值中的'''字符应该被转义。我经常看到这些值通过将它们加倍或用\前缀加以逃脱。请参阅https://en.wikipedia.org/wiki/Comma-separated_values#cite_ref-13

我要做的第一件事是修复导出这些文件的内容。但是,如果你不能这样做,你可以通过逃避“这是价值的一部分来解决这个问题。

你最好的选择可能是假设“只有一个逗号或换行符跟着(或者先于),如果它是一个值的结尾。那么你可以做一个正则表达式(从内存工作所以可能不是100% - 但应该给你正确的想法。你将不得不适应你有用的任何正则表达式库)

s/([^,\n])"([^,\n])/$1""$2/g

因此,如果你要运行你的示例文件,它会被转义为:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company ""MoscowMining"" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company"" Jankowski,A,B"""

或使用以下

s/([^,\n])"([^,\n])/$1\"$2/g

该文件将被转义如下:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski,A,B\""

根据您的CSV解析器,其中一个应该被接受并按预期工作。

如果@exe建议,您的CSV解析器还要求转义值中的逗号,您可以应用类似的正则表达式来替换逗号。

答案 1 :(得分:0)

如果我理解你需要的是在熊猫阅读csv之前使用引号和逗号。

喜欢这些:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1\, Owner2\, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski\,A\,B\""