BeautifulSoup无法完美解析

时间:2015-08-07 07:15:53

标签: python-2.7 beautifulsoup

当我使用soup.find("h3", text="Main Address:").find_parents("section")时,我得到的输出是:

[<section class="otlnrw" itemscope="" itemtype="http://microformats.org/wiki/hCard">\n<header>\n<h3 i
temprop="name">Main Address:</h3>\n</header>\n<p>600 Dexter <abbr title="Avenue\r"><abbr title="Avenu
e\r">Ave.</abbr></abbr><br/><span class="locality">Montgomery</span>, <span class="region">AL</span>,
 <span class="postal-code">36104</span></p> </section>]

现在我想只打印段落的文字。我无法做到这一点。请告诉我如何从这里只打印本节段落内的文字。

或者我的HTML页面是这样的:

<article>
<header>
    <h2 id="state-government">State Government</h2>
</header>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
    <header><h3  itemprop="name">Official Name:</h3></header>
    <p><a href="http://alaska.gov/">Alaska</a>
    </p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
    <header><h3  class="org">Governor:</h3></header>
    <p><a href="http://gov.alaska.gov/Walker/contact/email-the-governor.html">Bill Walker</a></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
    <header><h3  itemprop="name">Main Address:</h3></header>
    <p>120 East 4th Street<br>
        <span class="locality">Juneau</span>, 
        <span class="region">AK</span>, 
        <span class="postal-code">99801</span></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
    <header><h3  itemprop="name">Phone Number:</h3></header>
    <p class="spk tel">907-465-3708</p>
</section>
<p class="volver clearfix"><a href="#skiptarget">
    <span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
<section>
    <header><h2 id="state-agencies">State Agencies</h2></header>
    <ul>
        <li><a href="/state-consumer/alaska">Consumer Protection Offices</a></li>
        <li><a href="http://www.correct.state.ak.us/">Corrections Department</a></li>
        <li><a href="http://www.elections.alaska.gov/">Election Office</a></li>
        <li><a href="http://doa.alaska.gov/dmv/">Motor Vehicle Offices</a></li>
        <li><a href="http://doa.alaska.gov/dgs/property/">Surplus Property Sales</a></li>
        <li><a href="http://www.travelalaska.com">Travel and Tourism</a></li>
    </ul>
</section>
<p class="volver clearfix"><a href="#skiptarget">
    <span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
</article>

我应该如何从中获取地址文本。

1 个答案:

答案 0 :(得分:0)

您当前的代码返回包含一个元素的列表。要获取其中的<p>元素,您可以稍微扩展一下:

soup.find("h3", text="Main Address:").find_parents("section")[0]("p")

如果你想获得p元素中的内容,你必须再次获取该列表的第一个元素,并在其上运行decode_contents:

soup.find("h3", text="Main Address:").find_parents("section")[0]("p")[0].decode_contents(formatter="html")

在您的情况下将返回:

u'120 East 4th Street<br/><span class="locality">Juneau</span>, <span class="region">AK</span>, <span class="postal-code">99801</span>'