Question

我安装了BeautifulSoup 3x和4x：

$ pip2 search beautifulsoup
beautifulscraper (1.1.0)                          - Python web-scraping library that wraps urllib2 and BeautifulSoup.
scrapy-beautifulsoup (0.0.2)                      - Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
ipython-beautifulsoup (0.3)                       - Custom rendering of beautifulsoup objects       in IPython notebook and qtconsole
django-beautifulsoup-test (1.1.3)                 - TestCase class for using BeautifulSoup with Django tests
BeautifulSoup (3.2.1)                             - HTML/XML parser for quick-turnaround applications like screen-scraping.
  INSTALLED: 3.2.1 (latest)
beautifulsoup4-slurp (0.0.2)                      - Slurp packages Beautifulsoup4 into command line.
beautifulsoup4 (4.6.0)                            - Screen-scraping library
  INSTALLED: 4.6.0 (latest)
beautifulsoupselect (0.2)                         - Simple wrapper to integrate BeautifulSoup and soupselect.py in a single package

我在文本文件中有以下HTML代码（缩短了更好的视觉效果）：

<div class="postarea">
<div align="center"><img src="http://images" width="500" height="120" /></div>
<p><info></info><br />
<strong>Season of Monsters</p>
<p>Conflict in wartime.</p>
<p><span id="more-473091"></span></p>
<hr /><strong>Test44</strong></p>
<p><singlelink></singlelink><br />
<strong><span style="color:#8E2323;">Test1| Test2| Test3</span></strong></p>
<p><a href="http://gregregrgreg" rel="nofollow">testesttest</a></p>
</div>
</html>

我在BeautifulSoup4中使用以下python代码：

#!/usr/bin/python2
from bs4 import BeautifulSoup
with open('html.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

soup = BeautifulSoup(data)
postarea=soup.find('div', class_ = "postarea")
print postarea

除了关闭html标签之外，我期待该文本文件中的所有内容。我得到的是：

<div class="postarea">
<div align="center"><img height="120" src="http://images" width="500"/></div>
<p><info></info><br/>
<strong>Season of Monsters</strong></p>
<p>Conflict in wartime.</p>
<p><span id="more-473091"></span></p>
<hr/><strong>Test44</strong></div>

它向第4行添加了一个结束强标记，并在<strong>Test44</strong>之后关闭了初始div

如果我使用此代码而不是BeautifulSoup3：

#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
with open('html.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

soup = BeautifulSoup(data)
postarea=soup.find('div', {"class" : "postarea"})
print postarea

我得到了想要的结果（尽管它也在第4行添加了一个结束强标记）。

BS如何实际抓住div呢？因为如果它只是寻找匹配的closing-div-tag，我应该得到不同的结果。

我怎样才能正确抓住它？

提前致谢，

Hachel

Beautifulsoup4没有正确抓住div（而不是BS3）

0 个答案: