Question

当我使用pandas.read_csv读取CSV文件时，我收到了这个字符串：

'_\xf4\xd6_'

我无法规范化（删除非ASCII字符）：

>>> '_\xf4\xd6_'.encode("ascii","ignore")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 1: ordinal not in range(128)

我想要的是：

>>> u'_\xf4\xd6_'.encode("ascii","ignore")
'__'

IOW，我需要

告诉pandas.read_csv将字符串读取为unicode或
以某种方式将str转换为unicode我自己。

我该怎么做？

PS。为了完整起见，这里是代码（参见Get non-null elements in a pandas DataFrame）：

import pandas as pd

def normalize(s):
    "Clean-up the string: drop non-ASCII, normalize whitespace."
    return re.sub(r"\s+"," ",s,flags=re.UNICODE).encode("ascii","ignore")

df = pd.read_csv("foo.csv",low_memory=False)
my_strings = [normalize(s) for s in df[my_cols].stack.tolist()]

PPS。我无法控制CSV文件的内容（即，我无法通过“正确”写入CSV文件来解决问题）。

Answer 1

以下是使用bytearray的替代SELECT a.Col1,b.Col1 FROM tablea a JOIN tableb b ON ( (LEN(b.Col1) = 3 AND a.Col1 = LEFT(b.Col1,1)) OR (LEN(b.Col1) = 4 AND a.Col1 = LEFT(b.Col1,2)) )：

normalize

这是正确的解决方案吗？

如何将CSV中的unicode读入DataFrame

1 个答案: