使用beautifulsoup通过div标签查找div文本

时间:2019-05-22 08:40:11

标签: python html web-scraping beautifulsoup python-3.6

假定以下html代码段,我想从中提取与标签“价格”和“发货来源”相对应的值:

<div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>

这是较大的html文件的一部分。假设在某些文件中存在“发货自”标签,有时没有。由于html内容的可变性,我想使用类似方法的BeautifulSoup来处理此问题。存在多个divspan,这使得没有ID或类名的情况下很难选择

我的想法,像这样:

t = open('snippet.html', 'rb').read().decode('iso-8859-1')
s = BeautifulSoup(t, 'lxml')
s.find('div.divName[label*=Price]')
s.find('div.divName[label*=Ships from]')

但是,这将返回一个空列表。

3 个答案:

答案 0 :(得分:3)

使用<div id="buttonContainer"></div> <div id="resultContainer"></div>查找select,然后使用label

例如:

find_next_sibling().text

输出:

from bs4 import BeautifulSoup

html = """<div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>"""

soup = BeautifulSoup(html, "html.parser")
for lab in soup.select("label"):
    print(lab.find_next_sibling().text)

答案 1 :(得分:1)

尝试一下:

from bs4 import BeautifulSoup
from bs4.element import Tag

html = """ <div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>"""

s = BeautifulSoup(html, 'lxml')
row = s.find(class_='divName')

Solutio-1:

for tag in row.findChildren():
    if len(tag) > 1:
        continue
    if tag.name in 'span' and isinstance(tag, Tag):
        print(tag.text)
    elif tag.name in 'div' and isinstance(tag, Tag):
        print(tag.text)

解决方案2:

for lab in row.select("label"):
    print(lab.find_next_sibling().text)

O / P:

22.99
EU

答案 2 :(得分:0)

您可以使用:contains(与bs 4.7.1和next_sibling一起使用

import requests
from bs4 import BeautifulSoup as bs

html = '''
<div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>
'''

soup = bs(html, 'lxml')
items = soup.select('label:contains(Price), label:contains("Ships from")')

for item in items:
    print(item.text, item.next_sibling.next_sibling.text)