我想做的是选择第二个跨度并抓取其文本进行打印。 下面是HTML代码和BeautifulSoup代码
#HTML code
<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
#BeautifulSoup code
for h1 in soup.find_all('h1', id="productTitle"):
productTitle = h1.find('span').text
print(productTitle)
答案 0 :(得分:0)
这将在h1
标记内获取您需要的所有字段:
Python代码:
from bs4 import BeautifulSoup
text = '''
<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
'''
soup = BeautifulSoup(text,features='html.parser')
#BeautifulSoup code
for h1 in soup.find_all('h1', id="productTitle"):
spans = h1.find_all('span')
print('productBrand == > {}'.format(spans[0].text))
print('productTitle == > {}'.format(spans[1].text))
获取h1的所有跨度:
for h1 in soup.find_all('h1', id="productTitle"):
for i,span in enumerate(h1.find_all('span')):
print('span {} == > {}'.format(i,span.text))
演示: Here
答案 1 :(得分:0)
希望(并非总是如此),id应该是唯一的,意味着可能不需要find_all
。
对于bs4 4.7.1+,您可以使用:not排除具有ID的子范围
from bs4 import BeautifulSoup as bs
html = '''<h1 id="productTitle">
<a href="https://www.example.com/product/">
<span id="productBrand">BRAND</span>
</a>
<span>PRODUCT TITLE </span>
</h1>
'''
soup = bs(html, 'lxml')
print(soup.select_one('#productTitle span:not([id])').text)
您也可以生第一个孩子
print(soup.select_one('#productTitle span:nth-child(2)').text)
或
print(soup.select_one('#productTitle span:nth-child(even)').text)
甚至是直接同级组合器在子span
之后获得a
print(soup.select_one('#productTitle a + span').text)
或链接的next_sibling
print(soup.select_one('#productTitle a').next_sibling.next_sibling.text)