Question

我正在尝试使用BeautifulSoup从网站上发表演讲。然而，我遇到了问题，因为演讲分为许多不同的段落。我对编程非常陌生，并且无法弄清楚如何处理这个问题。页面的HTML如下所示：

<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney, 
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is    
at war; our economy is in recession; and the civilized world faces unprecedented dangers. 
Yet, the state of our Union has never been stronger.
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, 
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and  
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, 
saved a people from starvation, and freed a country from brutal oppression. 
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied 
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to 
sacrifice their lives are running for their own.

它持续了一段时间，有多个段落标记。我正在尝试提取范围内的所有文本。

我尝试了几种不同的方式来获取文本，但两者都未能获得我想要的文本。

我尝试的第一个是：

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
print thespan.string

给了我：

先生。演讲嘉宾，副总统切尼，国会议员，贵宾，同胞们：今晚我们聚会，我们的国家正在战争中;我们的经济陷入衰退;文明世界面临前所未有的危险。然而，我们联盟的状态从未如此强大。

这是直到第一个段落标记的文本部分。然后我尝试了：

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
for section in thespan:
     paragraph = section.findNext('p')
     if paragraph and paragraph.string:
         print '>', paragraph.string
     else:
         print '>', section.parent.next.next.strip()

这给了我第一段标签和第二段标签之间的文字。所以，我正在寻找一种方法来获取整个文本，而不仅仅是部分。

Answer 1

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())

span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
paras = [x.contents[0] for x in span.findAllNext("p")]  # this gives you the rest
# use .contents[0] instead of .string to deal with last para that's not well formed

print "%s\n\n%s" % (span.string, "\n\n".join(paras))

正如评论中所指出的，如果<p>标记包含更多嵌套标记，则上述方法效果不佳。这可以使用：

来处理

paras = ["".join(x.findAll(text=True)) for x in span.findAllNext("p")]

但是，对于没有结束标记的最后<p>，这不能很好地工作。一个hacky解决方法是以不同的方式对待它。例如：

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())
span = soup.find("span", {"class":"displaytext"})  
paras = [x for x in span.findAllNext("p")]

start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s" % (start, middle, last)

Answer 2

以下是lxml的完成方式：

import lxml.html as lh

tree = lh.parse('http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW')

text = tree.xpath("//span[@class='displaytext']")[0].text_content()

或者，这个问题的答案涵盖了如何使用beautifulsoup实现相同的目标：BeautifulSoup - easy way to to obtain HTML-free contents

接受答案中的辅助函数：

def textOf(soup):
    return u''.join(soup.findAll(text=True))

Answer 3

你应该尝试：

soup.span.renderContents()

用BeautifulSoup和多个段落刮痧

3 个答案: