Question

我已经四处寻找定制的解决方案，但我无法找到我所面临的用例的解决方案。

使用案例

我正在建立一个网站＆＃39; QA测试脚本将遍历大量HTML文档，并识别任何流氓字符。我不能使用纯非ascii方法，因为HTML文档包含诸如＆＃34;＆gt;＆＃34;等字符。和其他小角色。因此，我正在构建一个unicode彩虹字典，用于识别我和我的团队经常看到的一些常见的非ascii字符。以下是我的Python代码。

# -*- coding: utf-8 -*-

import re

unicode_rainbow_dictionary = {
    u'\u00A0':' ',
    u'\uFB01':'fi',
}

strings = ["This contains the annoying non-breaking space","This is fine!","This is not ﬁne!"]

for string in strings:
    for regex in unicode_rainbow_dictionary:
        result = re.search(regex,string)
        if result:
            print "Epic fail! There is a rogue character in '"+string+"'"
        else:
            print string

这里的问题是strings数组中的最后一个字符串包含一个非ascii连字符（组合fi）。当我运行此脚本时，它不会捕获连字符，但它会在第一种情况下捕获不可破坏的空格字符。

导致误报的原因是什么？

Answer 1

@jgfoot指出，对所有文本使用Unicode字符串。最简单的方法是使用from __future__默认为字符串的Unicode文字。此外，使用print作为函数将使代码Python 2/3兼容：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re

unicode_rainbow_dictionary = {
    '\u00A0':' ',
    '\uFB01':'fi',
}

strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not ﬁne!"]

for string in strings:
    for regex in unicode_rainbow_dictionary:
        result = re.search(regex,string)
        if result:
            print("Epic fail! There is a rogue character in '"+string+"'")
        else:
            print(string)

Answer 2

如果您有可能，请尽快切换到Python 3！ Python 2并不擅长处理unicode，而Python 3本身就是这样做的。

for string in strings:
    for character in unicode_rainbow_dictionary:
        if character in string:
            print("Rogue character '" + character + "' in '" + string + "'")

我无法在测试中获得不间断的空间。我通过使用"This contains the annoying" + chr(160) + "non-breaking space"来解决这个问题，然后匹配。

Answer 3

您的代码无法正常工作，因为在“strings”变量中，您在非unicode字符串中有unicode字符。你忘了把“你”放在他们面前，表示他们应该被视为unicode字符串。因此，当您在非unicode字符串中搜索unicode字符串时，它无法按预期工作

如果您将其更改为：

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]

它按预期工作。

解决像这样的unicode头痛是Python 3的主要优点。

这是解决问题的另一种方法。如果只是尝试将字符串编码为ASCII，如果它不起作用就捕获错误怎么样？：

def is_this_ascii(s):
    try:
        ignore = unicode(s).encode("ascii")
        return True
    except (UnicodeEncodeError, UnicodeDecodeError):
        return False

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]

for s in strings:
    print(repr(is_this_ascii(s)))

##False
##True
##False

自定义的非ascii字符标记

3 个答案: