我正在尝试使用find_all获得一个相当简单的元素列表。不管我使用哪个解析器,它总是会包含有限的元素,其中包含有用的任何东西,尽管显然它们应该在某些时候所有下一个元素都没有任何内容。我看到过很多人对此有疑问的帖子,但它始终是一个空白列表。我认为可能是因为html的另一部分是在向下滚动时生成的,但事实并非如此。
import requests
from bs4 import BeautifulSoup
URL = 'https://www.pracuj.pl/praca/analityk%20danych;kw/warszawa;wp?rd=0'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='results')
job_elems = results.find_all('li', class_='results__list-container-item')
for job_elem in job_elems:
#title_elem = job_elem.find('a', class_='offer-details__title-link')
#company_elem = job_elem.find('a', class_='offer-company__name')
#location_elem = job_elem.find('li' ,class_='offer-labels__item offer-labels__item--location')
#if title_elem is None:
# continue
#print(title_elem.text.strip())
#print(company_elem.text.strip())
#print(location_elem.text.strip())
print()
print(job_elem)
编辑:很抱歉,您不清楚。正如@TanmayaMeher所建议的那样,因为该链接在代码中可用,所以我没有粘贴任何html,而且我认为它更易于检查。
我提供的图片应该显示出问题开始的输出部分。请在下面看到一部分输出作为文本。第一段是我期望的最后一个被解析的元素,另一行是不包含任何内容的元素(“ li”标记),而我希望它们看起来像是正确的。
<li class="results__list-container-item">
<div class="offer offer--border offer--remoterecruitment">
<div class="offer__click">
<a class="offer__click-area" href="https://www.pracuj.pl/praca/chapter-lead-data-standardization-risk-area-warszawa,oferta,1000235456"></a>
</div>
<div class="offer__info">
<div class="offer-details">
<div class="offer-logo">
<a href="https://pracodawcy.pracuj.pl/company/20058995/profile"><img alt="logo" class="offer-logo__image" src="https://i.gpcdn.pl/oferty-loga-firm/wyniki-wyszukiwania/44864.png"/></a>
</div>
<div class="offer-details__text">
<h3 class="offer-details__title">
<a class="offer-details__title-link" href="https://www.pracuj.pl/praca/chapter-lead-data-standardization-risk-area-warszawa,oferta,1000235456">Chapter lead – Data Standardization (Risk Area)</a>
</h3>
<p class="offer-company">
<span class="offer-company__link-wrapper"></span>
<span class="offer-company__wrapper">
<a class="offer-company__name" href="https://pracodawcy.pracuj.pl/company/20058995/profile">ING Tech Poland</a>
</span>
</p>
</div>
<div class="offer-details__badge-wrap offer-details__badge-wrap--remoterecruitment">
<i class="mdi mdi-cellphone-message offer-details__badge-icon"></i>
<span class="offer-details__badge-name offer-details__badge-name--remoterecruitment">Rekrutacja zdalna</span>
</div>
</div>
<div class="offer-labels__wrapper">
<ul class="offer-labels">
<li class="offer-labels__item offer-labels__item--location">
<i class="mdi mdi-map-marker offer-labels__item-icon"></i>Warszawa </li>
</ul>
<ul class="offer-labels">
<li class="offer-labels__item">
</li>
</ul>
</div>
<div class="offer-description">
<input class="offer-description__toggler" id="offer-description---cid-23435479" type="checkbox"/>
<label class="offer-description__toggler-label" for="offer-description---cid-23435479">
<i class="mdi mdi-chevron-down offer-description__toggler-icon"></i>
</label>
<div class="offer-description__content-wrap">
<span class="offer-description__content">
Must have You are open for other people and eager to take on new challenges, You are passionate about working with people and developing talents of others make you fulfilled, You prefer to concentrate on quality, innovation of created products...
</span>
</div>
</div>
</div>
<div class="offer-regions__port"></div>
<div class="offer-actions">
<span class="offer-actions__date">
<span class="offer-actions__date-long">opublikowana: </span>13 cze<span class="offer-actions__date-long">rwca</span> 2020
</span>
<div class="offer-actions__favs"></div>
</div>
</div>
</li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
答案 0 :(得分:0)
数据以JavaScript window.__INITIAL_STATE__
变量的形式嵌入页面中。您可以使用re
/ json
模块进行解析。
例如:
import re
import json
import requests
url = 'https://www.pracuj.pl/praca/analityk%20danych;kw/warszawa;wp?rd=0'
html_text = requests.get(url).text
data = json.loads( re.search(r'window\.__INITIAL_STATE__ = (.*?\});', html_text).group(1) )
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data to screen:
for offer in data['offers']:
print('{:<10}{:<80}{}'.format(offer['commonOfferId'], offer['jobTitle'], offer['employer']))
打印:
23433957 Senior Business Intelligence Analyst w Zespole Data Intelligence Solutions KPMG
23436174 Quantitative Associate (Model Validation) ING Tech Poland
23436175 Data Analyst (Risk Modelling) (Risk Hub) ING Tech Poland
23436664 Reports Developer (VBA) Randstad Polska Sp. z o.o.
23440135 Firmwide Data Management - Data Integration - Change Project Lead – Associate J.P. Morgan Poland Services sp. z o.o.
23440182 Treasury – Product Control (P&L and Risk) Analyst J.P. Morgan Poland Services sp. z o.o.
23441295 Brand Reporting Manager JTI Polska sp. z o.o.
... and so on.