Question

今天早上我花了几个令人沮丧的时间，试图处理来自网页的字符串。我似乎无法获得一种降低提取字符串的一致方法，因此我可以检查关键字 - 它会让我绕过弯道。

以下是从DOM元素中检索文本的代码片段：

temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
    temp2 = whitespace.sub(' ', temp.contents[0])
    content = str(temp2)

UnicodeEncodeError：'ascii'编解码器无法对字符u'\ xa0'进行编码位置150：序数不在范围内（128）

我还尝试了以下陈述 - 其中没有一个有效;即它们导致同样的错误：

content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()

有没有人知道如何将BeautifulSoupTag中包含的文本转换为小写ASCII，所以我可能会对关键字进行不区分大小写的搜索？

Answer 1

您可能需要ASCII，但是您需要Unicode，并且很有可能您已经拥有它。 XML解析器返回unicode个对象。

首先做print type(temp2) ......它应该是unicode，除非发生了一些不幸的事情，比如whitespace.sub()事情;那是什么？

如果要将多个空格字符规范化为单个空格，请执行

temp2 = u' '.join(temp.contents[0].split())

这将使那个讨厌的''\ xA0'消失，因为它是一个空白（NO-BREAK SPACE）。

然后尝试content = temp2.lower()

BeautifulSoupTag，字符串和UnicodeEncodeError不是那么漂亮

1 个答案: