我正在使用Beautiful Soup进行网页抓取。我已将我的输出缩小到这个:
[<p><a class="puCloselink" href="javascript:window.close();" id="ctl00_CloseThisPage" onclick="javascript:window.close();" title="Close Window" usesubmitbehavior="false">Close Window</a></p>, <p class="txtBoxAlign">
<span style="display:none"><input id="Location" name="Location" type="hidden" value="48001"/>
</span>
<input id="newLocation" type="text"/>
<input id="GeoCode" name="GeoCode" type="hidden" value=""/>
<label for="newLocation">Address or ZIP Code (Required): </label>
</p>, <p class="PharmacyNameTxtBox txtBoxAlign">
<input id="PharmacyName" name="PharmacyName" type="text" value=""/>
<label for="PharmacyName">Pharmacy Name: </label>
</p>, <p>
<strong>CVS Pharmacy #</strong><br/>
1025 St Clair River Dr <br/>
Algonac, MI 48001<br/>
1-810-794-4941
</p>, <p>
Retail
</p>, <p>
Not applicable
</p>, <p>
<strong>Kroger Pharmacy</strong><br/>
2600 Pointe Tremble <br/>
Algonac, MI 48001<br/>
1-810-671-4002
</p>, <p>
Retail
</p>, <p>
Not applicable
</p>, <p>
<strong>Rite Aid Pharmacy 04943</strong><br/>
402 Pointe Tremble Road <br/>
Algonac, MI 48001<br/>
1-810-794-4985
</p>, <p>
Retail
</p>, <p>
Not applicable
</p>]
正如你所看到的,有一些&#34; p&#34;标签用&#34; strong&#34;在他们内部和一些&#34; p&#34;标签没有&#34;字符串&#34;在他们里面。如何隔离整个&#34; p&#34;标签,但仅限于那些&#34; p&#34;&#34;强&#34;包含在其中?
感谢。
答案 0 :(得分:1)
使用ElementTree解析并隔离所需的文本。不要忘记在你的html周围添加一个行标记给它一个root并让它可以解析。
import xml.etree.ElementTree as ElementTree
myHTML = '<row> <p><a class="puCloselink" href="javascript:window.close();" id="ctl00_CloseThisPage" onclick="javascript:window.close();" title="Close Window" usesubmitbehavior="false">Close Window</a></p>, <p class="txtBoxAlign"> <span style="display:none"><input id="Location" name="Location" type="hidden" value="48001"/> </span> <input id="newLocation" type="text"/> <input id="GeoCode" name="GeoCode" type="hidden" value=""/> <label for="newLocation">Address or ZIP Code (Required): </label> </p>, <p class="PharmacyNameTxtBox txtBoxAlign"> <input id="PharmacyName" name="PharmacyName" type="text" value=""/> <label for="PharmacyName">Pharmacy Name: </label> </p>, <p> <strong>CVS Pharmacy #</strong><br/> 1025 St Clair River Dr <br/> Algonac, MI 48001<br/> 1-810-794-4941 </p>, <p> Retail </p>, <p> Not applicable </p>, <p> <strong>Kroger Pharmacy</strong><br/> 2600 Pointe Tremble <br/> Algonac, MI 48001<br/> 1-810-671-4002 </p>, <p> Retail </p>, <p> Not applicable </p>, <p> <strong>Rite Aid Pharmacy 04943</strong><br/> 402 Pointe Tremble Road <br/> Algonac, MI 48001<br/> 1-810-794-4985 </p>, <p> Retail </p>, <p> Not applicable </p> </row>'
root = ElementTree.fromstring(myHTML)
p_elements = root.findall("p")
p_strong_elements = []
for element in p_elements:
if element.find('strong') is not None:
p_strong_elements.append(element)
# Borrowed from this post: http://stackoverflow.com/questions/380603/how-do-i-get-the-whole-text-of-an-element-using-elementtree
for p_strong_element in p_strong_elements:
print "<"+p_strong_element.tag+"> "+"".join( [ p_strong_element.text ] + [ ElementTree.tostring(e) for e in p_strong_element.getchildren() ] )+"</"+p_strong_element.tag+"> "
>>>'<p> <strong>CVS Pharmacy #</strong><br /> 1025 St Clair River Dr <br /> Algonac, MI 48001<br /> 1-810-794-4941 </p> '
>>>'<p> <strong>Kroger Pharmacy</strong><br /> 2600 Pointe Tremble <br /> Algonac, MI 48001<br /> 1-810-671-4002 </p> '
>>>'<p> <strong>Rite Aid Pharmacy 04943</strong><br /> 402 Pointe Tremble Road <br /> Algonac, MI 48001<br /> 1-810-794-4985 </p>
&#39;
答案 1 :(得分:0)
找到一个非常简单的答案:
for i in table:
if "strong" in str(i):
address.append(i)
可能不是最优雅的方式,但它对我有用。