BeautifulSoup4 stripped_strings给我字节对象?

时间:2016-01-09 14:46:29

标签: python python-2.7 unicode encoding beautifulsoup

我试图从一个看起来像这样的块引用中获取文本:

<blockquote class="postcontent restore ">
    01 Oyasumi
    <br></br>
    02 DanSin'
    <br></br>
    03 w.t.s.
    <br></br>
    04 Lovism
    <br></br>
    05 NoName
    <br></br>
    06 Gakkou
    <br></br>
    07 Happy☆Day
    <br></br>
    08 Endless End.
</blockquote>

我尝试做的是在python 2.7中(它无法解码☆字符,这就是我尝试使用编码的原因):

soup = BeautifulSoup(r.text, "html5lib") #r is from a requests get request
content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
for line in content:
    print(line.encode("utf-8"))

这就是我得到的:

b'01 Oyasumi'
b"02 DanSin'"
b'03 w.t.s.'
b'04 Lovism'
b'05 NoName'
b'06 Gakkou'
b'07 Happy\xe2\x98\x86Day'
b'08 Endless End.'

我做错了什么?

1 个答案:

答案 0 :(得分:1)

问题是如果未使用名为Unicode, Dammit的子库指定from_encoding,Beautiful Soup会将原始编码转换为Unicode。有关详细信息,请参阅文档中的Encodings部分。

>>> from bs4 import BeautifulSoup
>>> doc = '''<blockquote class="postcontent restore ">
...     01 Oyasumi
...     <br></br>
...     02 DanSin'
...     <br></br>
...     03 w.t.s.
...     <br></br>
...     04 Lovism
...     <br></br>
...     05 NoName
...     <br></br>
...     06 Gakkou
...     <br></br>
...     07 Happy☆Day
...     <br></br>
...     08 Endless End.
... </blockquote>'''
>>> soup = BeautifulSoup(doc, 'html5lib')
>>> soup.original_encoding 
u'windows-1252'
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
...     print(line)
... 
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.

要解决此问题,您有两种选择:

  1. 通过传入正确的from_encoding参数或排除错误的错误编码Dammit猜测。一个问题是并非所有的Parsers都支持exclude_encodings参数。例如,html5lib树构建器不支持exclude_encoding

    >>> soup = BeautifulSoup(doc, 'html5lib', from_encoding='utf-8')
    >>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
    >>> for line in content:
    ...     print(line)
    ... 
    01 Oyasumi
    02 DanSin'
    03 w.t.s.
    04 Lovism
    05 NoName
    06 Gakkou
    07 Happy☆Day
    08 Endless End.
    >>> 
    
  2. 使用lxml解析器

    >>> soup = BS(doc, 'lxml')
    >>> soup.original_encoding
    'utf-8'
    >>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
    >>> for line in content:
    ...     print(line)
    ... 
    01 Oyasumi
    02 DanSin'
    03 w.t.s.
    04 Lovism
    05 NoName
    06 Gakkou
    07 Happy☆Day
    08 Endless End.