我正在尝试使用BeautifulSoup来解析一些脏HTML。一个这样的HTML是http://f10.5post.com/forums/showthread.php?t=1142017
首先,树错过了大部分页面。其次,tostring(tree)
会将页面一半的<div>
等标记转换为</div>
等HTML实体。例如
原件:
<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`
toString(tree)
给出了
<div class="smallfont" align="center">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>
这是我的代码:
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)
print soup
由于
答案 0 :(得分:1)
使用beautifulsoup4
和非常宽松 html5lib
parser:
import urllib2
from bs4 import BeautifulSoup # NOTE: importing beautifulsoup4 here
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")
print soup