Python BeautifulSoup会在特定标签后立即提取文本

时间:2019-04-10 11:19:33

标签: python python-3.x web-scraping beautifulsoup

我正在尝试使用beautifulsoup和python从网页中提取信息。我想提取特定标签下面的信息。要知道其标签是否正确,我想对其文本进行比较,然后在下一个立即标签中提取文本。
例如,假设以下内容是HTML页面源代码的一部分,

<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>

我想检查<p class="title">的文本值是否为Procurement type,然后我想打印出服务
同样,如果<p class="title">文本值为Reference,则我想打印出 ANAJSKJD23423-Commission ,如果<p class="title">的值为Countries,则打印出所有国家,即比利时,法国,卢森堡

我知道我可以使用<p class="data strong">提取所有文本并将其添加到列表中,然后使用索引获取所有值。但事实是,这些<p class="title>的发生顺序是不确定的。...在某些地方,在采购类型之前可以提到国家。因此,我想检查文本值,然后提取下一个立即标记的文本值。我仍然是BeautifulSoup的新手,因此感谢您的帮助。谢谢

3 个答案:

答案 0 :(得分:4)

您可以通过多种方式完成操作。在这里。

from bs4 import BeautifulSoup
htmldata='''<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>'''

soup=BeautifulSoup(htmldata,'html.parser')

items=soup.find_all('p', class_='title')
for item in items:
    if ('Procurement type' in item.text) or ('Reference' in item.text):
        print(item.findNext('p').text)

答案 1 :(得分:2)

您还可以在bs4 4.7.1中使用:contains伪类。尽管我已经通过列表,但您可以将每个条件分开

from bs4 import BeautifulSoup as bs
import re

html = 'yourHTML'   
soup = bs(html, 'lxml')
items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]
print(items)

输出:

enter image description here

答案 2 :(得分:1)

当您使用.find().find_all()然后使用.next_siblingfindNext()来获取包含内容的下一个标记时,可以添加参数以检查特定文本。

即:

soup.find('p', {'class':'title'}, text = 'Procurement type')

给出:

html = '''<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>'''

您可以执行以下操作:

from bs4 import BeautifulSoup     

soup = BeautifulSoup(html, 'html.parser')

alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')
for sibling in alpha.next_siblings:
    try:
        print (sibling.text)
    except:
        continue

输出:

Services

ref = soup.find('p', {'class':'title'}, text = 'Reference')
for sibling in ref.next_siblings:
    try:
        print (sibling.text)
    except:
        continue

输出:

ANAJSKJD23423-Commission    

countries = soup.find('p', {'class':'title'}, text = 'Countries')
names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')
names = [name.strip() for name in names if not name.isspace()]

for country in names:
    print (country)

输出:

Belgium
France
Luxembourg