我有一个带有div标签的页面源,例如下面的示例页面源。我想像下面的示例一样抓取所有网址,并将它们保存在列表中。
示例网址:
/model-airplane-kits-s/2379.htm
来自:
<a data-control-id="aP52Q/QyTbqArQOpbKv4EQ==" data-control-name="A_jobssearch_job_result_click" href="/model-airplane-kits-s/2379.htm" id="ember1513" class="job-card-search__link-wrapper js-focusable-card disabled ember-view">
我尝试使用下面的代码从href中抓取网址。我正在尝试使用span类仅对包含job-card-search__easy-airplane的div标签进行过滤。该代码不返回任何网址,仅返回一个空列表。我是Beautifulsoup和Selenium的新手。如果有人可以指出我的问题,并提出解决方案,我将非常感激。尤其是如果您还可以给出一些解释,例如我需要如何搜索html的树形结构。
代码:
soup = BeautifulSoup(driver.page_source)
tags = soup.find_all('a', {'class': 'job-card-search__easy-airplane', 'href': True})
urls = [t['href'] for t in tags]
页面来源:
<div data-control-name="A_jobssearch_job_result_click" data-job-id="urn:li:fs_normalized_jobPosting:1175863492" tabindex="0" role="button" id="ember1507" class="job-card-search--two-pane jobs-search-results__list--card--viewport-tracking-1 job-card-search job-card-search--column job-card-search job-card-search--clickable job-card-search--outline-default ember-view"><artdeco-entity-lockup size="4" id="ember1508" class="artdeco-entity-lockup--size-4 artdeco-entity-lockup ember-view"><figure id="ember1509" class="artdeco-entity-lockup__image artdeco-entity-lockup__image--type-square ember-view" type="square"><a data-control-id="aP52Q/QyTbqArQOpbKv4EQ==" data-control-name="A_jobssearch_job_result_click" tabindex="-1" href="/jobs/view/1175863492/?eBP=CwEAAAFqIxiBlPhCqFcaiXqaLT8ZCYXTIftwHuk7g59oqTz7fLS2Usfj45gbPf53raGy8aX-F7FvqLIf3MJgOTHo3Ugkxh6sCVhZlkZRMQH3gDk8lSE_wujH7Mz9tU8Upy0ZIWHS9wbUErl6g8Z8C2-z1YCW85y0eMG57HHPJnWYYbtoCS9Wh_NGgMmlglzGytFLwYgXEu56gDUcWhRkT_AHODGr3-ZLjO6FcpctLngpJnHm4r2dmo9F8AUfP3HYWjOK-pToyQlStkfh0IcKMce2jIuCxe3Wgc90v7HF7kEItq-WdL1IdbnHbvN9gPBrSubLfU_pPqmwGRoTmMlPygTbXERDrw4&recommendedFlavor=SCHOOL_RECRUIT&refId=fd370713-e20e-4b02-9676-9009d8e52d34&trk=d_flagship3_search_srp_jobs" id="ember1510" class="job-card-search__link-wrapper js-focusable-card disabled ember-view"> <img class="lazy-image loaded job-card-search__logo-image" title="Ancestry" alt="Ancestry logo" height="64" width="64" src="https://media.licdn.com/dms/image/C560BAQFzwmebdgodyw/company-logo_100_100/0?e=1563408000&v=beta&t=Xr94FzOXIsd2wULd8cHG7Lr8nppKm0wWGCph-_N4YMk">
</a>
</figure>
<artdeco-entity-lockup-content id="ember1511" class="job-card-search__content-wrapper artdeco-entity-lockup__content ember-view"><h3 id="ember1512" class="job-card-search__title artdeco-entity-lockup__title ember-view"><a data-control-id="aP52Q/QyTbqArQOpbKv4EQ==" data-control-name="A_jobssearch_job_result_click" href="/model-airplane-kits-s/2379.htm" id="ember1513" class="job-card-search__link-wrapper js-focusable-card disabled ember-view"> Data Scientist - Search
<span class="job-card-search__promoted-tag-separator"> </span><span class="job-card-search__promoted-tag">Promoted</span>
</a>
</h3>
<h4 id="ember1514" class="job-card-search__company-name t-14 t-black artdeco-entity-lockup__subtitle ember-view"><a data-control-id="aP52Q/QyTbqArQOpbKv4EQ==" data-control-name="job_card_company_link" href="/company/397181/" id="ember1515" class="job-card-search__company-name-link ember-view"> Ancestry
</a></h4>
<h5 id="ember1516" class="job-card-search__location artdeco-entity-lockup__caption ember-view"><!---->
<li-icon aria-hidden="true" type="map-marker-icon" class="job-card-search__exact-location-icon" size="small"><svg viewBox="0 0 24 24" width="24px" height="24px" x="0" y="0" preserveAspectRatio="xMinYMin meet" class="artdeco-icon" focusable="false"><path d="M8,4a2,2,0,1,0,2,2A2,2,0,0,0,8,4ZM8,7.13A1.13,1.13,0,1,1,9.13,6,1.13,1.13,0,0,1,8,7.13ZM8,1A5,5,0,0,0,3,6a5.37,5.37,0,0,0,.41,2S5.91,13,7.22,15.52A0.86,0.86,0,0,0,8,16H8a0.86,0.86,0,0,0,.78-0.48C10.09,13,12.59,8,12.59,8A5.37,5.37,0,0,0,13,6,5,5,0,0,0,8,1Zm2.88,6.24L8,12.92,5.12,7.24A3.49,3.49,0,0,1,4.88,6a3.13,3.13,0,0,1,6.25,0A3.49,3.49,0,0,1,10.88,7.24Z" class="small-icon" style="fill-opacity: 1"></path></svg></li-icon>
San Francisco, CA, US
</h5>
<!----></artdeco-entity-lockup-content>
</artdeco-entity-lockup>
<!---->
<div class="job-card-search__body">
<p class="job-card-search__description-snippet">
Combining the rich information in family trees and historical records with the genetic details revea...
<span class="job-card-search__source-domain">jobs.smartrecruiters.com</span>
</p>
<div class="job-card-search__job-flavors-container job-flavors">
<div id="ember1517" class="job-flavors__flavor job-flavors__flavor--school-recruit ember-view"><a data-control-name="jobdetails_sharedconnections" href="/search/results/people/?facetCurrentCompany=397181&facetSchool=17816&origin=JOB_PAGE_CANNED_SEARCH" id="ember1518" class="search-s-shared-connections__link job-flavors__link link-without-visited-state ember-view"> <div class="job-flavors__logo-container">
<img class="lazy-image loaded job-flavors__logo-image" title="California Polytechnic State University-San Luis Obispo" alt="California Polytechnic State University-San Luis Obispo" src="https://media.licdn.com/dms/image/C560BAQERJB5dSuJ9Ow/company-logo_100_100/0?e=1563408000&v=beta&t=qIVll2vKhp3fUGa1FYyqjduYZkuuo-ApJ-Jiur-j1sY">
</div>
<div class="job-flavors__label">
5 alumni work here
</div>
</a></div>
</div>
<!---->
<ul class="job-card-search__footer mt1 t-12 t-black--light list-style-none">
<li class="job-card-search__footer-item">
<time class="job-card-search__time-badge job-card-search__time-badge--new" datetime="2019-04-15">
6 hours ago
</time>
</li>
<li class="job-card-search__footer-item">
<span class="job-card-search__easy-airplane">
<li-icon aria-hidden="true" type="linkedin-inbug-color-icon" class="mr1" size="small"><svg viewBox="0 0 24 24" width="24px" height="24px" x="0" y="0" preserveAspectRatio="xMinYMin meet" class="artdeco-icon" focusable="false"><g class="small-icon" style="fill-opacity: 1">
<path d="M13.75,1H2.25A1.25,1.25,0,0,0,1,2.25v11.5A1.25,1.25,0,0,0,2.25,15h11.5A1.25,1.25,0,0,0,15,13.75V2.25A1.25,1.25,0,0,0,13.75,1Z" style="fill: #0073b1"></path>
<path d="M4,2.68A1.36,1.36,0,0,0,2.69,4,1.36,1.36,0,0,0,4,5.31,1.36,1.36,0,0,0,5.31,4,1.36,1.36,0,0,0,4,2.68Z" style="fill: #fff"></path>
<rect x="3" y="6" width="2" height="7" style="fill: #fff"></rect>
<path d="M10.25,5.88a3,3,0,0,0-2.31,1H7.88V6H6v7H8V10c0-1.17.48-2,1.62-2,.91,0,1.38.66,1.38,2v3h2V8.88C13,7,12.21,5.88,10.25,5.88Z" style="fill: #fff"></path>
</g></svg></li-icon>
Easy airplane
</span>
</li>
</ul>
</div>
</div>
答案 0 :(得分:0)
看到更多的html会有所帮助,但是对于顶部的html,也许可以按a
标签的类进行过滤?您正在尝试将目标span
的类名应用于a
标签btw。底部html中具有该类的范围没有子a
标记。
job-card-search__link-wrapper
但是,我注意到该元素没有出现在底部的较长html段中。
对于最高版本:
links = [item['href'] for item in soup.select('a.job-card-search__link-wrapper')]
顶部和底部html提供的标记确实具有您可以使用的共享属性
[data-control-name]