Question

我正在尝试使用pandas read_csv读取CSV文件。数据看起来像这样（例子）

thing;weight;price;colour
apple;1;2;red
m &amp; m's;0;10;several
cherry;0,5;2;dark red

由于HTML转义的＆符号，第二行将根据pandas包含5个字段。我怎样才能确定，该内容是否正确读取？

这里的例子非常类似于我的数据：separator是“;”，没有字符串引号，cp1251编码。我收到的数据非常大，阅读它必须一步完成（意味着没有在python之外进行预处理）。

我在pandas doc中找不到任何引用（我使用pandas 0.19和python 3.5.1）。有什么建议？提前谢谢。

Answer 1

Unescape the html character references：

import html
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    content = html.unescape(f.read())
    g.write(content)
print(content)
# thing;weight;price;colour
# apple;1;2;red
# m & m's;0;10;several
# cherry;0,5;2;dark red

然后以通常的方式加载csv：

import pandas as pd
df = pd.read_csv('data-fixed.csv', sep=';')
print(df)

产量

     thing weight  price    colour
0    apple      1      2       red
1  m & m's      0     10   several
2   cherry    0,5      2  dark red

虽然数据文件非常大＆＃34;但您似乎有足够的内存可以将其读入DataFrame。因此，您还应该有足够的内存来将文件读取为单个字符串：f.read()。将一次调用转换为html.unescape比在许多较小的字符串上调用html.unescape更有效。这就是我建议使用

的原因

with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    content = html.unescape(f.read())
    g.write(content)

而不是

with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    for line in f:
        g.write(html.unescape(line))

如果您需要多次读取此数据文件，则需要修复它（并保存它）到磁盘）因此，每次要解析时都不需要调用html.unescape 数据。这就是为什么我建议将未转义的内容写入data-fixed.csv。

如果读取此数据是一次性任务，并且您希望避免写入磁盘的性能或资源成本，则可以使用StringIO（内存中类文件对象）：

from io import StringIO
import html
import pandas as pd

with open('data.csv', 'r', encoding='cp1251') as f:
    content = html.unescape(f.read())
df = pd.read_csv(StringIO(content), sep=';')
print(df)

Answer 2

您可以使用正则表达式作为pandas.read_csv的分隔符在您的具体情况下，您可以尝试：

pd.read_csv("test.csv",sep = "(?<!&amp);")
#         thing weight  price    colour
#0        apple      1      2       red
#1  m &amp; m's      0     10   several
#2       cherry    0,5      2  dark red

选择;之前没有&amp的所有CAShapeLayer，这可以扩展到其他转义字符

当存在HTML转义字符串时，使用python（pandas）读取CSV文件

2 个答案: