Question

我正在使用Beautiful Soup 3解析一些HTML，但它包含HTML实体，Beautiful Soup 3不会自动为我解码：

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

如何解码text中的HTML实体以获取"£682m"而不是"£682m"。

Answer 1

Python 3.4 +

HTMLParser.unescape已被弃用，was supposed to be removed in 3.5，虽然它被误删了。它将很快从语言中删除。相反，请使用html.unescape()：

import html
print(html.unescape('&pound;682m'))

请参阅https://docs.python.org/3/library/html.html#html.unescape

Python 2.6-3.3

您可以使用标准库中的HTML解析器：

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

请参阅http://docs.python.org/2/library/htmlparser.html

您还可以使用six兼容性库来简化导入：

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

Answer 2

Beautiful Soup处理实体转换。在Beautiful Soup 3中，您需要为convertEntities构造函数指定BeautifulSoup参数（请参阅存档文档的'Entity Conversion'部分）。在Beautiful Soup 4中，实体会自动解码。

美丽的汤3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

美丽的汤4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

Answer 3

您可以使用w3lib.html库中的replace_entities

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

Answer 4

美丽的汤4允许您set a formatter to your output

如果您传入formatter=None，Beautiful Soup将不会修改字符串在输出上。这是最快的选择，但可能会导致 Beautiful Soup生成无效的HTML / XML，如下例所示：

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

Answer 5

我有一个类似的编码问题。我使用了normalize（）方法。将数据框导出到另一个目录中的.html文件时，使用pandas .to_html（）方法时出现Unicode错误。我最终做到了，它奏效了...

    import unicodedata

dataframe对象可以是您喜欢的任何东西，我们称之为表...

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

对表格数据进行编码，以便我们可以将其导出到模板文件夹中的.html文件（可以是您希望的任何位置：））

     #this is where the magic happens
     html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

将规范化的字符串导出到html文件

    file = open("templates/home.html","w") 

    file.write(html_data) 

    file.close()

参考：unicodedata documentation

Answer 6

这可能与此无关。但是为了从整个文档中消除这些html entites，你可以做这样的事情:(假设文档=页面，请原谅邋code的代码，但如果你有关于如何使它变得更好的想法，我所有的耳朵 - 我很新这一点）。

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

在Python字符串中解码HTML实体？

6 个答案:

Python 3.4 +

Python 2.6-3.3

美丽的汤3

美丽的汤4