Question

我想提取网页，例如： https://www.glassdoor.com/Overview/Working-at-Apple-EI_IE1138.11,16.htm，所以我想以以下格式返回结果。

Website       Headquarters  Size             Revenue                Type
www.apple.com Cupertino, CA 10000+ employees $10+ billion (USD) per year     Company - Public (AAPL)

然后我将以下代码与beatifulsoup结合使用。

all_href = com_soup.find_all('span', {'class': re.compile('value')})
all_href = list(set(all_href))

它返回带有<span>的标签。此外，它没有在<label>

下显示标签

[<span class="value"> Computer Hardware &amp; Software</span>,
 <span class="value"> Company - Public (AAPL) </span>,
 <span class="value">10000+ employees</span>,
 <span class="value"> $10+ billion (USD) per year</span>,
 <span class="value-title" title="4.0"></span>,
 <span class="value">Cupertino, CA</span>,
 <span class="value"> 1976</span>,
 <span class="value-title" title="5.0"></span>,
 <span class="value website"><a class="link" href="http://www.apple.com" rel="nofollow noreferrer" target="_blank">www.apple.com</a></span>]

Answer 1

您的beautifulsoup拉力太具体了。您将捕获所有“ span”标记，其中class = value。

当您查看HTML时，可以通过搜索某些字段的文本来快速找到该部分。您应该做的是在class ='infoEntity'的所有div标签中获取所有内容，其中包含您有兴趣从该“概述”部分获取的所有7个字段。

其中，每个字段都有一个标签标签，该标签标签的属性与您上面想要的标签相关，并且位于“概述”部分。

所以，开始于：

from bs4 import BeautifulSoup
data = """
<div class="eep-pill"><p class="tightVert h2 white"><strong>Enhanced</strong> Profile&nbsp;<span class="round ib"><i class="icon-star-white"></i></span></p></div></header><section class="center flex-grid padVertLg eepModal"><h2>Try Enhanced Profile Free for a Month</h2><p>Explore the many benefits of having a premium branded profile on Glassdoor, like increased influence and advanced analytics.</p><div class="margBot"><i class="feaIllustration"></i></div><a href='/employers/enhanced/landing_input.htm?src=info_mod' class='gd-btn gd-btn-link gradient gd-btn-1 gd-btn-med span-1-2'><span>Get Started</span><i class='hlpr'></i></a><p>Changes wont be saved until you sign up for an Enhanced Profile subscription.</p></section></div></article><article id='MainCol'><div id='EmpBasicInfo' class='module empBasicInfo ' data-emp-id='1138'><div class=''><header class='tbl fill '><h2 class='cell middle tightVert blockMob'> Apple Overview</h2></header><div class='info flexbox row col-hh'><div class='infoEntity'><label>Website</label><span class='value website'><a class="link" href="http://www.apple.com" target="_blank" rel="nofollow noreferrer">www.apple.com</a></span></div><div class='infoEntity'><label>Headquarters</label><span class='value'>Cupertino, CA</span></div><div class='infoEntity'><label>Size</label><span class='value'>10000+ employees</span></div><div class='infoEntity'><label>Founded</label><span class='value'> 1976</span></div><div class='infoEntity'><label>Type</label><span class='value'> Company - Public (AAPL) </span></div><div class='infoEntity'><label>Industry</label><span class='value'> Computer Hardware & Software</span></div><div class='infoEntity'><label>Revenue</label><span class='value'> $10+ billion (USD) per year</span></div></div></div><div class=''><div data-full="We&amp;rsquo;re a diverse collection of thinkers and doers, continually reimagining what&amp;rsquo;s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with services, including iTunes, the App Store, Apple Music, and Apple Pay. And the same passion for innovation that goes into our products also applies to our practices &amp;mdash; strengthening our commitment to leave the world better than we found it." class='margTop empDescription'> We&rsquo;re a diverse collection of thinkers and doers, continually reimagining what&rsquo;s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with ... <span class='link minor moreLink' id='ExpandDesc'>Read more</span></div><div class='hr'><hr/></div><h3 class='margTop'>Glassdoor Awards</h3>
"""
items = []
soup = BeautifulSoup(data, 'lxml')
get_info = iter(soup.find_all("div", {"class" : "infoEntity"}))
for item in get_info:
    label = item.find("label")
    value = item.find("span")
    items.append((label.string, value.string))

这样，您将获得项目中的元组列表，其输出为：

[('Website', 'www.apple.com'), ('Headquarters', 'Cupertino, CA'), ('Size', '10000+ employees'), ('Founded', ' 1976'), ('Type', ' Company - Public (AAPL) '), ('Industry', ' Computer Hardware & Software'), ('Revenue', ' $10+ billion (USD) per year')]

从那里，您可以按照自己喜欢的任何格式打印该列表。

Answer 2

我在https://www.glassdoor.com/Overview/Working-at-Apple-EI_IE1138.11,16.htm中注意到

您应该找到<div class="infoEntity">而不是<span class="value">才能得到想要的东西。

all_href = com_soup.find_all('div', {'class': re.compile('infoEntity')}).find_all(['span','label'])
all_href = list(set(all_href))

它将返回您想要的所有<span>和<label>。

如果您想将<span>和<label>合并在一起，而不是将其更改为

all_href = [x.decode_contents(formatter="html") for x in com_soup.find_all('div', {'class': re.compile('infoEntity')})]
#or
all_href = [[x.find('span'), x.find('label')] for x in com_soup.find_all('div', {'class': re.compile('infoEntity')})]

使用python在HTML上提取<label> <span>标签

2 个答案: