假定以下html代码段,我想从中提取与标签“价格”和“发货来源”相对应的值:
<div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>
这是较大的html文件的一部分。假设在某些文件中存在“发货自”标签,有时没有。由于html内容的可变性,我想使用类似方法的BeautifulSoup来处理此问题。存在多个div
和span
,这使得没有ID或类名的情况下很难选择
我的想法,像这样:
t = open('snippet.html', 'rb').read().decode('iso-8859-1')
s = BeautifulSoup(t, 'lxml')
s.find('div.divName[label*=Price]')
s.find('div.divName[label*=Ships from]')
但是,这将返回一个空列表。
答案 0 :(得分:3)
使用<div id="buttonContainer"></div>
<div id="resultContainer"></div>
查找select
,然后使用label
例如:
find_next_sibling().text
输出:
from bs4 import BeautifulSoup
html = """<div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for lab in soup.select("label"):
print(lab.find_next_sibling().text)
答案 1 :(得分:1)
尝试一下:
from bs4 import BeautifulSoup
from bs4.element import Tag
html = """ <div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>"""
s = BeautifulSoup(html, 'lxml')
row = s.find(class_='divName')
Solutio-1:
for tag in row.findChildren():
if len(tag) > 1:
continue
if tag.name in 'span' and isinstance(tag, Tag):
print(tag.text)
elif tag.name in 'div' and isinstance(tag, Tag):
print(tag.text)
解决方案2:
for lab in row.select("label"):
print(lab.find_next_sibling().text)
O / P:
22.99
EU
答案 2 :(得分:0)
您可以使用:contains
(与bs 4.7.1和next_sibling一起使用
import requests
from bs4 import BeautifulSoup as bs
html = '''
<div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>
'''
soup = bs(html, 'lxml')
items = soup.select('label:contains(Price), label:contains("Ships from")')
for item in items:
print(item.text, item.next_sibling.next_sibling.text)