使用Beautiful Soup(EPA网站)获取基于HTML元素的元素

时间:2018-10-29 16:44:26

标签: web-scraping beautifulsoup

我想打印诸如https://www.epa.gov/enforcement/chevron-settlement-information-sheethttps://www.epa.gov/enforcement/ngl-crude-logistics-llc-clean-air-act-settlement之类的EPA和解协议的“民事处罚”部分

出于以下HTML来源

<h2 id="civil">Civil Penalty</h2>
<p>Chevron U.S.A. will pay a $2.95 million civil penalty, of which $2,492,750 will be paid to the United States and $457,250 to the State of Mississippi.</p>

我想得到 Chevron U.S.A.将支付295万美元的民事罚款...

所有结算情况说明书的结构相同。

<h2 id="civil">Civil Penalty</h2>
<p>NGL will pay a civil penalty of $25 million. The penalty is based, in part, on the company’s limited ability to pay a larger penalty.</p>

我发现与Get an element before a string with Beautiful Soup相似-但这与我的问题并不完全相同。

这是我的代码框架:

import requests
from bs4 import BeautifulSoup
import sys

for i in ['chevron-settlement-information-sheet', 'ngl-crude-logistics-llc-clean-air-act-settlement', 'derive-systems-clean-air-act-settlement']:

    page = requests.get("https://www.epa.gov/enforcement/"+i)
    soup = BeautifulSoup(page.content, 'html.parser')

    data = []

    for result in soup.find_all('h2', id='civil'):
        data.append(result)

print(data)

如何在<p>之后直接打印<h2 id="civil">部分?

2 个答案:

答案 0 :(得分:1)

您可以尝试使用兄弟选择器+

p=soup.select('#civil + p')
print(p[0].getText())

这将仅选择p元素的下一个兄弟元素#civil

答案 1 :(得分:0)

您可能未获得想要的结果的一个原因是您在URL中添加了/history,从而导致了404 error page。如果删除该部分,然后使用findNext('p')在页面上<h2 id="civil">之后的下一个段落元素,则会得到预期的结果:

import requests
from bs4 import BeautifulSoup

for url in ['chevron-settlement-information-sheet', 'ngl-crude-logistics-llc-clean-air-act-settlement', 'derive-systems-clean-air-act-settlement']:

    page = requests.get("https://www.epa.gov/enforcement/" + url)
    soup = BeautifulSoup(page.content, 'html.parser')

    result = soup.find('h2', {'id': 'civil'}).findNext('p')
    print(result.text)

打印输出:

Chevron U.S.A. will pay a $2.95 million civil penalty, of which $2,492,750 will be paid to the United States and $457,250 to the State of Mississippi.
NGL will pay a civil penalty of $25 million. The penalty is based, in part, on the company’s limited ability to pay a larger penalty.
Derive will pay a civil penalty of $300,000, as the company has limited financial ability to pay a higher penalty.