当我使用pandas.read_csv
读取CSV文件时,我收到了这个字符串:
'_\xf4\xd6_'
我无法规范化(删除非ASCII字符):
>>> '_\xf4\xd6_'.encode("ascii","ignore")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 1: ordinal not in range(128)
我想要的是:
>>> u'_\xf4\xd6_'.encode("ascii","ignore")
'__'
IOW,我需要
pandas.read_csv
将字符串读取为unicode或str
转换为unicode
我自己。我该怎么做?
PS。为了完整起见,这里是代码(参见Get non-null elements in a pandas DataFrame):
import pandas as pd
def normalize(s):
"Clean-up the string: drop non-ASCII, normalize whitespace."
return re.sub(r"\s+"," ",s,flags=re.UNICODE).encode("ascii","ignore")
df = pd.read_csv("foo.csv",low_memory=False)
my_strings = [normalize(s) for s in df[my_cols].stack.tolist()]
PPS。我无法控制CSV文件的内容(即,我无法通过“正确”写入CSV文件来解决问题)。
答案 0 :(得分:0)
以下是使用bytearray
的替代SELECT a.Col1,b.Col1
FROM tablea a
JOIN tableb b ON (
(LEN(b.Col1) = 3 AND a.Col1 = LEFT(b.Col1,1))
OR
(LEN(b.Col1) = 4 AND a.Col1 = LEFT(b.Col1,2))
)
:
normalize
这是正确的解决方案吗?