我的HTML页面如下所示:
data = <section class="otln" itemscope="" itemtype="http://microformats.org/wiki/hCard">
<header>
<h3 class="org">Website:</h3>
</header>
<p><a href="http://www.abilityone.gov">U.S. AbilityOne Commission </a></p> </section>,
<section class="otln" itemscope="" itemtype="http://microformats.org/wiki/hCard">
<header>
<h3 itemprop="name">Main Address:</h3>
</header>
<p class="spk street-address">1401 S. Clark Street<br/>Suite 715<br/><span class="locality">Arlington</span>, <span class="region">VA</span> <span class="postal-code">22202-3259</span></p> </section>,
<section class="otln" itemscope="" itemtype="http://microformats.org/wiki/hCard">
<header>
<h3 itemprop="name">Phone Number:</h3>
</header>
<p>1-703-603-7740</p> </section>,
<section class="otln" itemscope="" itemtype="http://microformats.org/wiki/hCard">
<header>
<h3 class="org">Government branch:</h3>
</header>
<p>Executive Department Sub-Office/Agency/Bureau</p>
</section>
我想从此HTML页面的<p>
标记中提取所有详细信息,例如网站,主要地址,电话号码和政府部门的href
。我已经尝试了很多不同的变体来获得它们但不能完成它。
EDITED
我的代码:
soup = BeautifulSoup(data,'lxml')
website.append([l.find('a')['href'] for l in soup.find_all('section',class_='otln')])
以上尝试获取'href'会导致TypeError: 'NoneType' object is not subscriptable
我有工作解决方案来获得主要地址,电话号码和政府部门。如果我能获得网站的'href'即“http://www.ability.gov”
答案 0 :(得分:1)
soup = BeautifulSoup(data, 'lxml')
for h, p in zip(soup.findAll('h3'), soup.findAll('p')):
# h is the header, p is the paragraph
a = p.find('a') # is it the website ?
print('%-20s\t%s' % (h.text, a['href'] if bool(a) else p.text))