Question

我有这个文件，其中一些行包含一些Unicode文字，例如： “他是谁\ xe2 \ x80 \ x99s？\ n \ n在杰克·阿列塔（Jake Arrieta）\ xe2 \ x80 \ x99s没有打扰之后，一名球迷冲进了球场，参加了Cubs \ xe2 \ x80 \ x99庆典。”

我要删除那些xe2 \ x80 \ x99之类的字符。

如果我声明一个包含这些字符的字符串，但我的解决方案在从CSV文件读取时不起作用，则可以删除它们。我用熊猫来读取文件。

解决方案尝试 1.正则表达式 2.解码和编码 3.Lambda

正则表达式解决方案

stripped = lambda s: "".join(i for i in s if 31 < ord(i) < 127)
code2 = stripped(line)
print(code2)

LAMBDA解决方案

code3 = (line.encode('ascii', 'ignore')).decode("utf-8")
print(code3)

编码解决方案

df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
    print(stripped(row['text']))
    print(re.sub(r'[^\x00-\x7f]',r'', row['text']))
    print(row['text'].encode('ascii', 'ignore')).decode("utf-8"))

如何读取文件

df = pandas.read_csv('file.csv',encoding = "utf-8")

for index, row in df.iterrows():
    en = row['text'].encode()
    print(type(en))
    newline = en.decode('utf-8')
    print(type(newline))
    print(repr(newline))
    print(newline.encode('ascii', 'ignore'))
    print(newline.encode('ascii', 'replace'))

建议的方法

$fontfile= __DIR__.'/Fontname.ttf';

Answer 1

您的字符串有效utf-8。因此，可以将其直接转换为python字符串。

然后可以使用str.encode()将其编码为ascii。可以使用'ignore'忽略非ASCII字符。

也可以：'replace'

line_raw =  b'Who\xe2\x80\x99s he?'

line = line_raw.decode('utf-8')
print(repr(line))

print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))

'Who’s he?'
b'Whos he?'
b'Who?s he?'

回到您的原始问题，您的第三种方法是正确的。只是顺序错误。

code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)

最后要提供一个可行的熊猫示例，在这里：

import pandas

df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
    print(row['text'].encode('ascii', 'ignore'))

没有必要做decode('utf-8')，因为熊猫会为您这样做。

最后，如果您的python字符串包含非ASCII字符，则可以通过执行以下操作来剥离它们

text = row['text'].encode('ascii', 'ignore').decode('ascii')

这会将文本转换为ascii字节，剥离所有无法表示为ascii的字符，然后转换回文本。

您应该查找python3字符串和字节之间的区别，我希望这应该为您清除一切。

Python：如何从文件中删除字符\ x91 \ x87 \ xf0 \ x9f \ x91 \ x87的范围

1 个答案: