我正在尝试从html源代码中提取Manufacturer #
和PAW11295
并陷入困境。感谢任何建议。
soupTest.find("div",id = "AddnInfo")
Out[121]:
<div id="AddnInfo">
<h3>Additional Info</h3>
<p>
<p class="sknText"><label>“R”Web#:</label> <span class="value">215904</span> </p>
<p class="skuText"><label>SKU:</label> <span class="value">B7958C02</span> </p>
<p class="upc"><label>UPC/EAN/ISBN:</label> <span class="value">092317112958</span></p>
<p><label>Manufacturer #:</label> PAW11295</p>
<p><label>Product Weight:</label>2.2 pounds</p>
<p><label>Product Dimensions (in inches):</label>12.7 x 10.1 x 5.4</p>
</p>
</div>
提前致谢。
答案 0 :(得分:3)
以下方法应该有效。它需要第5个<p>
元素并获取<label>
文本。然后删除它并显示整个<p>
标记的剥离文本:
from bs4 import BeautifulSoup
html = """
<div id="AddnInfo">
<h3>Additional Info</h3>
<p>
<p class="sknText"><label>“R”Web#:</label> <span class="value">215904</span> </p>
<p class="skuText"><label>SKU:</label> <span class="value">B7958C02</span> </p>
<p class="upc"><label>UPC/EAN/ISBN:</label> <span class="value">092317112958</span></p>
<p><label>Manufacturer #:</label> PAW11295</p>
<p><label>Product Weight:</label>2.2 pounds</p>
<p><label>Product Dimensions (in inches):</label>12.7 x 10.1 x 5.4</p>
</p>
</div>
"""
soup = BeautifulSoup(html)
div = soup.find('div', {'id':'AddnInfo'})
p = div.find_all('p')[4]
label = p.find('label')
manufacturer = label.text
label.extract()
id = p.get_text(strip=True)
print manufacturer
print id
将显示:
Manufacturer #:
PAW11295
答案 1 :(得分:1)
我想你想要这样的东西。 首先选择外部P标签。然后选择所有内部P标签。然后引用你想要的单个P标签,在这种情况下是第四个。
infoDiv = soupTest.find("div",id = "AddnInfo")
outerPs = infoDiv.p # isolate the outer <P>
innerPs = outerPs.find_all('p') # returns a list of the inner <P>s
manufacturer_number = innerPs[3].string # you will have to trim the <label>
manufacturer_code = innerPs[3].label.string # will need trimming