Question

我目前正处于Python3的学习过程中，我正在抓取一个网站获取一些数据，这样可以正常工作，但是当打印出p标签时，我无法让它工作正如我所料。

import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup



data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')

for child in dialog:
    childtext = child.get_text()
    #have tried child.string aswell (exactly the same result)
    childlist.append(childtext.encode('utf-8', 'ignore')
    #Have tried with str(childtext.encode('utf-8', 'ignore'))

print (childlist)

一切正常，但打印是＆＃34;字节＆＃34;

b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'

ascii编码的真实示例文本：

b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

请注意，公告是p，其余的是强大的＆＃39;在p标签下。

使用utf-8编码的相同样本

b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "

我希望得到：

"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"

如你所见，错误的字符被删除了＆＃34; ascii＆＃34;，但有些 会破坏一些换行符，而我还没有弄清楚如何正确打印它，那时候还在那儿！

我真的无法弄清楚如何删除b并正确编码或解码。我已经尝试了所有的解决方案＆＃34;我可以谷歌了。

HTML Content = utf-8

我宁愿在处理之前不要更改完整的数据，因为它会搞砸我的其他工作，而且我认为不需要。

Prettify不起作用。

有什么建议吗？

Answer 1

首先，您正在获取表单b'stuff'的输出，因为您正在调用.encode()，它会返回bytes个对象。如果要打印字符串以便阅读，请将它们保存为字符串！

作为猜测，我假设您正在寻找从HTML中打印字符串的方式，就像在浏览器中看到的那样。为此，您需要解码HTML字符串编码，如this SO answer中所述，对于Python 3.5来说，这意味着：

import html
html.unescape(childtext)

除此之外，这会将HTML字符串中的任何 序列转换为'\xa0'个字符，这些字符将作为空格打印。但是，如果你想在这些字符上打破这些字符，尽管 字面意思是＆＃34;非破坏空间＆＃34;，你必须在打印前用实际空格替换那些字符，例如：使用x.replace('\xa0', ' ')。

BeautifulSoup4无法正确打印。 Python3

1 个答案: