如何仅提取<p> <strong>内部</strong> </p>的HTML标记

时间:2014-07-03 20:55:10

标签: python html web-scraping beautifulsoup

我正在使用Beautiful Soup进行网页抓取。我已将我的输出缩小到这个:

[<p><a class="puCloselink" href="javascript:window.close();" id="ctl00_CloseThisPage" onclick="javascript:window.close();" title="Close Window" usesubmitbehavior="false">Close Window</a></p>, <p class="txtBoxAlign">
<span style="display:none"><input id="Location" name="Location" type="hidden"     value="48001"/>
</span>
<input id="newLocation" type="text"/>
<input id="GeoCode" name="GeoCode" type="hidden" value=""/>
<label for="newLocation">Address or ZIP Code (Required): </label>
</p>, <p class="PharmacyNameTxtBox txtBoxAlign">
<input id="PharmacyName" name="PharmacyName" type="text" value=""/>
<label for="PharmacyName">Pharmacy Name: </label>
</p>, <p>
<strong>CVS Pharmacy #</strong><br/>
                        1025 St Clair River Dr <br/>
                        Algonac, MI 48001<br/>
                        1-810-794-4941
                        </p>, <p>
                     Retail
                       </p>, <p>
                        Not applicable
                        </p>, <p>
<strong>Kroger Pharmacy</strong><br/>
                        2600 Pointe Tremble <br/>
                        Algonac, MI 48001<br/>
                        1-810-671-4002
                        </p>, <p>
                     Retail
                       </p>, <p>
                        Not applicable
                        </p>, <p>
<strong>Rite Aid Pharmacy 04943</strong><br/>
                        402 Pointe Tremble Road <br/>
                        Algonac, MI 48001<br/>
                        1-810-794-4985
                        </p>, <p>
                     Retail
                       </p>, <p>
                        Not applicable
                        </p>]

正如你所看到的,有一些&#34; p&#34;标签用&#34; strong&#34;在他们内部和一些&#34; p&#34;标签没有&#34;字符串&#34;在他们里面。如何隔离整个&#34; p&#34;标签,但仅限于那些&#34; p&#34;&#34;强&#34;包含在其中?

感谢。

2 个答案:

答案 0 :(得分:1)

使用ElementTree解析并隔离所需的文本。不要忘记在你的html周围添加一个行标记给它一个root并让它可以解析。

import xml.etree.ElementTree as ElementTree

myHTML = '<row> <p><a class="puCloselink" href="javascript:window.close();" id="ctl00_CloseThisPage" onclick="javascript:window.close();" title="Close Window" usesubmitbehavior="false">Close Window</a></p>, <p class="txtBoxAlign"> <span style="display:none"><input id="Location" name="Location" type="hidden"     value="48001"/> </span> <input id="newLocation" type="text"/> <input id="GeoCode" name="GeoCode" type="hidden" value=""/> <label for="newLocation">Address or ZIP Code (Required): </label> </p>, <p class="PharmacyNameTxtBox txtBoxAlign"> <input id="PharmacyName" name="PharmacyName" type="text" value=""/> <label for="PharmacyName">Pharmacy Name: </label> </p>, <p> <strong>CVS Pharmacy #</strong><br/> 1025 St Clair River Dr <br/> Algonac, MI 48001<br/> 1-810-794-4941 </p>, <p> Retail </p>, <p> Not applicable </p>, <p> <strong>Kroger Pharmacy</strong><br/> 2600 Pointe Tremble <br/> Algonac, MI 48001<br/> 1-810-671-4002 </p>, <p> Retail </p>, <p> Not applicable </p>, <p> <strong>Rite Aid Pharmacy 04943</strong><br/> 402 Pointe Tremble Road <br/> Algonac, MI 48001<br/> 1-810-794-4985 </p>, <p> Retail </p>, <p> Not applicable </p> </row>'

root = ElementTree.fromstring(myHTML)

p_elements = root.findall("p")

p_strong_elements = []

for element in p_elements:
   if element.find('strong') is not None:
      p_strong_elements.append(element)

# Borrowed from this post: http://stackoverflow.com/questions/380603/how-do-i-get-the-whole-text-of-an-element-using-elementtree
for p_strong_element in p_strong_elements:
    print "<"+p_strong_element.tag+"> "+"".join( [ p_strong_element.text ] + [ ElementTree.tostring(e) for e in p_strong_element.getchildren() ] )+"</"+p_strong_element.tag+"> "
>>>'<p>  <strong>CVS Pharmacy #</strong><br /> 1025 St Clair River Dr <br /> Algonac, MI 48001<br /> 1-810-794-4941 </p> '
>>>'<p>  <strong>Kroger Pharmacy</strong><br /> 2600 Pointe Tremble <br /> Algonac, MI 48001<br /> 1-810-671-4002 </p> '
>>>'<p>  <strong>Rite Aid Pharmacy 04943</strong><br /> 402 Pointe Tremble Road <br /> Algonac, MI 48001<br /> 1-810-794-4985 </p> 

&#39;

答案 1 :(得分:0)

找到一个非常简单的答案:

for i in table:
    if "strong" in str(i):
        address.append(i)

可能不是最优雅的方式,但它对我有用。