Python:从html获取段落

时间:2014-06-30 11:43:59

标签: python html

我正在迭代一系列链接以获取所有奥巴马的演讲。但是,对于某些链接,它们的html格式如下所示:

<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">If 
              there is anyone out there who still doubts that America is a place 
              where all things are possible; who still wonders if the dream of 
              our founders is alive in our time; who still questions the power 
              of our democracy, tonight is your answer.</font></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">It’s 
              the answer told by lines that stretched around schools and churches 
              in numbers this nation has never seen; by people who waited three 
              hours and four hours, many for the very first time in their lives, 
              because they believed that this time must be different; that their 
              voice could be that difference.</font></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">It’s 
              the answer spoken by young and old, rich and poor, Democrat and 
              Republican, black, white, Latino, Asian, Native American, gay, straight, 
              disabled and not disabled – Americans who sent a message to 
              the world that we have never been a collection of Red States and 
              Blue States: we are, and always will be, the United States of America.</font></p>

如果我soup.find_all('font'),我只得到其中一段而不是整篇文章。但是,对于其他链接,它们的html格式可能类似于下面的文本,soup.find_all('font')将整个段落返回给我。

</font></strong><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><br/>
</font></font><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><br/>
            My fellow citizens:<br/>
<br/>
            I stand here today humbled by the task before us, grateful for the 
            trust you have bestowed, mindful of the sacrifices borne by our ancestors. 
            I thank President Bush for his service to our nation, as well as the 
            generosity and cooperation he has shown throughout this transition.<br/>
<br/>
            Forty-four Americans have now taken the presidential oath. The words 
            have been spoken during rising tides of prosperity and the still waters 
            of peace. Yet, every so often the oath is taken amidst gathering clouds 
            and raging storms. At these moments, America has carried on not simply 
            because of the skill or vision of those in high office, but because 
            We the People have remained faithful to the ideals of our forbearers, 
            and true to our founding documents.<br/>
<br/>
            So it has been. So it must be with this generation of Americans.<br/>
</font> <div align="left">

目标:我想获得整个演讲而不仅仅是段落。如何在python中使用beautifulsoup实现这一目标?

这两个演讲来自:

http://obamaspeeches.com/E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm

http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm

1 个答案:

答案 0 :(得分:1)

不幸的是,因为它们不一定是标准的 - 它会为你做更多的工作,因为1个逻辑流程不会全部击中它们。

但是,对于您列出的特定情况,您可以执行以下任一操作:

选择包含font标记的父级,即table。 (注意:您需要一些逻辑来验证哪个表包含您想要的内容,因为该网站使用表格布局

for table in soup.find_all('table'):
    if this_is_the_table_you_want:
        print(table.text)

<强> - 或 -

只需根据您已有的标签构建字符串

speech_text = ""
for font in soup.find_all('font'):
    speech_text += font.text