问题...... BeautifulSoup Parsing

时间:2011-08-27 23:42:31

标签: python beautifulsoup

<h2 class="sectionTitle">BACKGROUND</h2>
Mr. Paul J. Fribourg has bla bla</span>
<div style="margin-top:8px;">
    <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
</div>

我想从保罗先生那里获取有关blabla的信息 有些网页在保罗先生面前<p>,所以我可以使用FindNext('p') 但是,有些网页没有上面示例中的<p> ..

这是我的代码,有<p>

background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents

但是当我没有<p>我怎样才能提取信息?

2 个答案:

答案 0 :(得分:2)

很难从你给我们的例子中看出来,但我认为你可以在h2之后得到下一个节点。在此示例中, Lewis Carroll 有一个p - aragraph标记,而您的朋友Paul只有一个结束span标记:

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     p = section.findNext('p')
...     if p:
...         print '> ',  p.string
...     else:
...         print '> ', section.parent.next.next.strip()
...
>  Mr. Lewis Carroll has bla bla
>  Mr. Paul J. Fribourg has bla bla

以下评论:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     paragraph = section.findNext('p')
...     if paragraph and paragraph.string:
...         print '> ', paragraph.string
...     else:
...         print '> ', section.parent.next.next.strip()
... 
>  Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

当然,您可以查看版权声明,等等 ......

答案 1 :(得分:0)

  

“有些网页在保罗先生面前<p>,所以我可以使用FindNext('p')但是,有些网页没有<p>,就像上面的例子一样。” / em>的

您没有提供足够的信息来识别您的字符串:

  • 固定节点结构,例如的getChildren()[1] .getChildren()[0]的.text
  • 如果你的代码前面有魔术字符串'BACKGROUND',那么你找到下一个节点的方法似乎很好 - 只是不要假设标签名称是'p'
  • 正则表达式(例如“(先生|女士)......”

如果名称前面没有<p>,请向我们展示一个HTML示例吗?