strong.text.strip() == 'Classes of business'
似乎不是True
我对beautifulsoup和python非常有经验,但无论出于何种原因,似乎无法抓住这些数据。下面是我正在使用的HTML块:
<div class="marketing-directories-results">
<ul>
<li>
<div class="contact-details">
<h2>
A I I Insurance Brokerage of Massachusetts Inc
</h2>
<br/>
<address>
183 Davis St
<br/>
East Douglas
<br/>
Massachusetts
<br/>
U S A
<br/>
MA 01516-113
</address>
<p>
<a href="http://www.agencyint.com">
www.agencyint.com
</a>
</p>
</div>
<span data-toggle=".info-cov-0">
Additional trading information
<i class="icon plus">
</i>
</span>
<ul class="result-info info-cov-0 cc">
<li>
<strong>
Accepts Business From:
</strong>
<ul class="cc">
<li>
U.S.A
</li>
</ul>
</li>
<li>
<strong>
Classes of business
</strong>
<ul class="cc">
<li>
Engineering
</li>
<li>
NM General Liability (US direct)
</li>
<li>
Property D&F (US binder)
</li>
<li>
Terrorism
</li>
</ul>
</li>
<li>
<strong>
Disclaimer:
</strong>
<p>
Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
</p>
<p>
it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
</p>
<p>
the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
</p>
</li>
</ul>
</li>
<li>
<div class="contact-details">
<h2>
ABCO Insurance Underwriters Inc
</h2>
<br/>
<address>
ABCO Building, 350 Sevilla Avenue, Suite 201
<br/>
Coral Gables
<br/>
Florida
<br/>
U S A
<br/>
33134
</address>
<p>
<a href="http://www.abcoins.com">
www.abcoins.com
</a>
</p>
</div>
<span data-toggle=".info-cov-1">
Additional trading information
<i class="icon plus">
</i>
</span>
<ul class="result-info info-cov-1 cc">
<li>
<strong>
Accepts Business From:
</strong>
<ul class="cc">
<li>
U.S.A
</li>
</ul>
</li>
<li>
<strong>
Classes of business
</strong>
<ul class="cc">
<li>
Property D&F (US binder)
</li>
<li>
Terrorism
</li>
</ul>
</li>
<li>
<strong>
Disclaimer:
</strong>
<p>
Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
</p>
<p>
it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
</p>
<p>
the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
</p>
</li>
</ul>
</li>
</ul>
</div>
显示的页面有10条记录(我只包含了前两个的HTML,因此我可以帮助遍历每个公司),每个记录对应一个公司以及有关它们的更多信息,例如他们的地址,网站网址,以及诸如“接受来自美国的商业”这样的事情
我已经能够获取姓名,地址和网站网址,但我无法获得“U.S.A”。在每个公司的“接受业务”(如果有的话)下并将其存储在列表中的正确位置。
我可以通过以下方式到达第一个美国:
other_info = comp_info_area.find_all('li')
other_info_next = other_info[0].find('ul')
other_info_next_next = other_info_next.find_all('li')
other_info_next_next_next = other_info_next_next[0].find('ul', class_='cc')
other_info_next_next_next_next = other_info_next_next_next.find('li')
print(other_info_next_next_next_next.text)
其中comp_info_area
是存储上述HTML的Beautifulsoup对象。这将返回:U.S.A
我怎样才能抓住剩下的东西?我无法弄清楚如何导航树到达那里。非常感谢任何帮助,谢谢。
编辑:以下是没有该信息的公司的示例:
<li>
<div class="contact-details">
<h2>
Acadian Managers, LLC
</h2>
<br/>
<address>
8550 United Plaza Boulevard
<br/>
Suite 702
<br/>
Baton Rouge
<br/>
Louisiana
<br/>
U.S.A
<br/>
70809
</address>
<p>
<a href="http://www.acadianmanagers.com">
www.acadianmanagers.com
</a>
</p>
</div>
</li>
答案 0 :(得分:0)
您可以为每个公司创建一个字典,然后将其附加到列表中。
# Get the <ul> tag which contains all the companies.
results = soup.find('div', class_='marketing-directories-results').ul
companies_info = []
# Iterate over the companies (all <li> tags that are direct children of results,
# can be found by setting 'recursive=False').
for company in results.find_all('li', recursive=False):
company_info = {}
company_info['Name'] = company.find('h2').text.strip()
company_info['Address'] = company.find('address').get_text(', ', strip=True)
company_info['Website'] = company.find('a', href=True)['href']
try:
li = company.find('ul', class_='result-info').find('li')
if li.strong.text.strip() == 'Accepts Business From:':
company_info['Accepts Business From'] = li.find('li').text.strip()
except AttributeError:
# If this error is caught, it means this info is not available.
# You can use the below line to set it to 'None', or simply use 'pass' to do nothing.
company_info['Accepts Business From'] = None
companies_info.append(company_info)
print(companies_info)
输出:
[
{
'Name': 'A I I Insurance Brokerage of Massachusetts Inc',
'Address': '183 Davis St, East Douglas, Massachusetts, U S A, MA 01516-113',
'Website': 'http://www.agencyint.com',
'Accepts Business From': 'U.S.A'
},
{
'Name': 'ABCO Insurance Underwriters Inc',
'Address': 'ABCO Building, 350 Sevilla Avenue, Suite 201, Coral Gables, Florida, U S A, 33134',
'Website': 'http://www.abcoins.com',
'Accepts Business From': 'U.S.A'
},
{
'Name': 'Acadian Managers, LLC',
'Address': '8550 United Plaza Boulevard, Suite 702, Baton Rouge, Louisiana, U.S.A, 70809',
'Website': 'http://www.acadianmanagers.com',
'Accepts Business From': None
}
]