Question

如何在Python中使用re替换unicode值？我正在寻找这样的东西：

line.replace('Ã','')
line.replace('¢','')
line.replace('Ã¢','')

或者有什么方法可以替换文件中的所有非ASCII字符。实际上我把PDF文件转换成ASCII，我得到了一些非ASCII字符[例如PDF中的子弹]

请帮帮我。

Answer 1

如果你有

，你想要替换

title.decode('latin-1').encode('utf-8')

或者如果你想忽略

unicode(title, errors='replace')

Answer 2

在评论中反馈后进行修改。

另一个解决方案是检查每个字符的数值，看看它们是否在128以下，因为ascii从0到127.如下所示：

# coding=utf-8

def removeUnicode():
    text = "hejsanäöåbadasd wodqpwdk"
    asciiText = ""
    for char in text:
        if(ord(char) < 128):
            asciiText = asciiText + char

    return asciiText

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

这是jd的基准测试答案的更改版本：

# coding=utf-8

def removeUnicode():
    text = u"hejsanäöåbadasd wodqpwdk"
    if(isinstance(text, str)):
        return text.decode('utf-8').encode("ascii", "ignore")
    else:
        return text.encode("ascii", "ignore")        

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

使用str字符串作为输入输出第一个解决方案：

computer:~ Ancide$ python test1.py
Time taken: 5.88719677925

使用unicode字符串作为输入输出第一个解决方案：

computer:~ Ancide$ python test1.py
Time taken: 7.21077990532

使用str字符串作为输入输出第二个解决方案：

computer:~ Ancide$ python test1.py
Time taken: 2.67580914497

使用unicode字符串作为输入输出第二个解决方案：

computer:~ Ancide$ python test1.py
Time taken: 1.740680933

结论

编码是更快的解决方案，编码字符串的代码更少;因此更好的解决方案。

Answer 3

您必须将Unicode字符串编码为ASCII，忽略发生的任何错误。方法如下：

>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'

Answer 4

尝试将re.UNICODE标志传递给params。像这样：

re.compile("pattern", re.UNICODE)

有关详细信息，请参阅manual页。

如何在Python中使用re替换Unicode值？

4 个答案:

结论