如何将CSV中的unicode读入DataFrame

时间:2017-01-17 20:25:48

标签: csv pandas unicode

当我使用pandas.read_csv读取CSV文件时,我收到了这个字符串:

'_\xf4\xd6_'

我无法规范化(删除非ASCII字符):

>>> '_\xf4\xd6_'.encode("ascii","ignore")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 1: ordinal not in range(128)

我想要的是:

>>> u'_\xf4\xd6_'.encode("ascii","ignore")
'__'

IOW,我需要

  • 告诉pandas.read_csv将字符串读取为unicode或
  • 以某种方式将str转换为unicode我自己。

我该怎么做?

PS。为了完整起见,这里是代码(参见Get non-null elements in a pandas DataFrame):

import pandas as pd

def normalize(s):
    "Clean-up the string: drop non-ASCII, normalize whitespace."
    return re.sub(r"\s+"," ",s,flags=re.UNICODE).encode("ascii","ignore")

df = pd.read_csv("foo.csv",low_memory=False)
my_strings = [normalize(s) for s in df[my_cols].stack.tolist()]

PPS。我无法控制CSV文件的内容(即,我无法通过“正确”写入CSV文件来解决问题)。

1 个答案:

答案 0 :(得分:0)

以下是使用bytearray的替代SELECT a.Col1,b.Col1 FROM tablea a JOIN tableb b ON ( (LEN(b.Col1) = 3 AND a.Col1 = LEFT(b.Col1,1)) OR (LEN(b.Col1) = 4 AND a.Col1 = LEFT(b.Col1,2)) )

normalize

这是正确的解决方案吗?