Beautifulsoup4没有正确抓住div(而不是BS3)

时间:2017-11-10 11:31:50

标签: python python-2.7 beautifulsoup

我安装了BeautifulSoup 3x和4x:

$ pip2 search beautifulsoup
beautifulscraper (1.1.0)                          - Python web-scraping library that wraps urllib2 and BeautifulSoup.
scrapy-beautifulsoup (0.0.2)                      - Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
ipython-beautifulsoup (0.3)                       - Custom rendering of beautifulsoup objects       in IPython notebook and qtconsole
django-beautifulsoup-test (1.1.3)                 - TestCase class for using BeautifulSoup with Django tests
BeautifulSoup (3.2.1)                             - HTML/XML parser for quick-turnaround applications like screen-scraping.
  INSTALLED: 3.2.1 (latest)
beautifulsoup4-slurp (0.0.2)                      - Slurp packages Beautifulsoup4 into command line.
beautifulsoup4 (4.6.0)                            - Screen-scraping library
  INSTALLED: 4.6.0 (latest)
beautifulsoupselect (0.2)                         - Simple wrapper to integrate BeautifulSoup and soupselect.py in a single package

我在文本文件中有以下HTML代码(缩短了更好的视觉效果):

<div class="postarea">
<div align="center"><img src="http://images" width="500" height="120" /></div>
<p><info></info><br />
<strong>Season of Monsters</p>
<p>Conflict in wartime.</p>
<p><span id="more-473091"></span></p>
<hr /><strong>Test44</strong></p>
<p><singlelink></singlelink><br />
<strong><span style="color:#8E2323;">Test1| Test2| Test3</span></strong></p>
<p><a href="http://gregregrgreg" rel="nofollow">testesttest</a></p>
</div>
</html>

我在BeautifulSoup4中使用以下python代码:

#!/usr/bin/python2
from bs4 import BeautifulSoup
with open('html.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

soup = BeautifulSoup(data)
postarea=soup.find('div', class_ = "postarea")
print postarea

除了关闭html标签之外,我期待该文本文件中的所有内容。我得到的是:

<div class="postarea">
<div align="center"><img height="120" src="http://images" width="500"/></div>
<p><info></info><br/>
<strong>Season of Monsters</strong></p>
<p>Conflict in wartime.</p>
<p><span id="more-473091"></span></p>
<hr/><strong>Test44</strong></div>

它向第4行添加了一个结束强标记,并在<strong>Test44</strong>之后关闭了初始div

如果我使用此代码而不是BeautifulSoup3:

#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
with open('html.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

soup = BeautifulSoup(data)
postarea=soup.find('div', {"class" : "postarea"})
print postarea

我得到了想要的结果(尽管它也在第4行添加了一个结束强标记)。

BS如何实际抓住div呢?因为如果它只是寻找匹配的closing-div-tag,我应该得到不同的结果。

我怎样才能正确抓住它?

提前致谢,

Hachel

0 个答案:

没有答案