Question

这就是我获取数据的方式：

page = requests.get('some website')
data = bs4.BeautifulSoup(page.content,"lxml")

我正在使用它来进行unescaping：

from xml.sax.saxutils import unescape
html_escape_table = { '"':"&quot;", "'":"&apos;"}
html_unescape_table = {v:k for k,v in html_escape_table.items()}

def html_unescape(text):
    return unescape(text,html_unescape_table)

当我尝试在data的任何部分（我相信它是一个字符串）上调用unescape时，它并没有像它应该的那样进行unescaping。相反，它只返回我用函数调用的相同字符串（例如\u00e8）。

但是，当我尝试调用html_unescape()传入我实际输入的字符串时（例如html_unescape('\u00e8')），它可以正常工作。

当我从BeautifulSoup获取的数据中传入一段字符串时，为什么它不起作用？

Answer 1

标准Python会打印<type 'str'>而非<class 'str'> - 您必须已收到自定义str课程。您需要追踪它的来源（requests？BeautifulSoup？）并查看它支持的操作。

为什么html unescape不能在这里工作？

1 个答案: