我安装了BeautifulSoup 3x和4x:
$ pip2 search beautifulsoup
beautifulscraper (1.1.0) - Python web-scraping library that wraps urllib2 and BeautifulSoup.
scrapy-beautifulsoup (0.0.2) - Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
ipython-beautifulsoup (0.3) - Custom rendering of beautifulsoup objects in IPython notebook and qtconsole
django-beautifulsoup-test (1.1.3) - TestCase class for using BeautifulSoup with Django tests
BeautifulSoup (3.2.1) - HTML/XML parser for quick-turnaround applications like screen-scraping.
INSTALLED: 3.2.1 (latest)
beautifulsoup4-slurp (0.0.2) - Slurp packages Beautifulsoup4 into command line.
beautifulsoup4 (4.6.0) - Screen-scraping library
INSTALLED: 4.6.0 (latest)
beautifulsoupselect (0.2) - Simple wrapper to integrate BeautifulSoup and soupselect.py in a single package
我在文本文件中有以下HTML代码(缩短了更好的视觉效果):
<div class="postarea">
<div align="center"><img src="http://images" width="500" height="120" /></div>
<p><info></info><br />
<strong>Season of Monsters</p>
<p>Conflict in wartime.</p>
<p><span id="more-473091"></span></p>
<hr /><strong>Test44</strong></p>
<p><singlelink></singlelink><br />
<strong><span style="color:#8E2323;">Test1| Test2| Test3</span></strong></p>
<p><a href="http://gregregrgreg" rel="nofollow">testesttest</a></p>
</div>
</html>
我在BeautifulSoup4中使用以下python代码:
#!/usr/bin/python2
from bs4 import BeautifulSoup
with open('html.txt', 'r') as myfile:
data=myfile.read().replace('\n', '')
soup = BeautifulSoup(data)
postarea=soup.find('div', class_ = "postarea")
print postarea
除了关闭html标签之外,我期待该文本文件中的所有内容。我得到的是:
<div class="postarea">
<div align="center"><img height="120" src="http://images" width="500"/></div>
<p><info></info><br/>
<strong>Season of Monsters</strong></p>
<p>Conflict in wartime.</p>
<p><span id="more-473091"></span></p>
<hr/><strong>Test44</strong></div>
它向第4行添加了一个结束强标记,并在<strong>Test44</strong>
之后关闭了初始div
如果我使用此代码而不是BeautifulSoup3:
#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
with open('html.txt', 'r') as myfile:
data=myfile.read().replace('\n', '')
soup = BeautifulSoup(data)
postarea=soup.find('div', {"class" : "postarea"})
print postarea
我得到了想要的结果(尽管它也在第4行添加了一个结束强标记)。
BS如何实际抓住div呢?因为如果它只是寻找匹配的closing-div-tag,我应该得到不同的结果。
我怎样才能正确抓住它?
提前致谢,
Hachel