标签转换为HTML实体?

时间:2015-06-23 21:30:20

标签: python html parsing beautifulsoup html-parsing

我正在尝试使用BeautifulSoup来解析一些脏HTML。一个这样的HTML是http://f10.5post.com/forums/showthread.php?t=1142017

首先,树错过了大部分页面。其次,tostring(tree)会将页面一半的<div>等标记转换为&lt;/div&gt;等HTML实体。例如

原件:

<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`

toString(tree)给出了

&lt;div class="smallfont" align="center"&gt;All times are GMT -4. The time now is &lt;span class="time"&gt;02:12 PM&lt;/span&gt;.&lt;/div&gt;

这是我的代码:

from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)

print soup

由于

1 个答案:

答案 0 :(得分:1)

使用beautifulsoup4非常宽松 html5lib parser

import urllib2
from bs4 import BeautifulSoup  # NOTE: importing beautifulsoup4 here

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")

print soup