我试图在 h2 标签内的“a”标签中获取链接,但我遇到的问题是其中有 2 个在单独的“父”标签中。
我正在查看链接:https://emerging-europe.com/tag/poland/
以下是我到现在为止的代码。
from bs4 import BeautifulSoup
import requests
url='https://emerging-europe.com/tag/poland/'
response=requests.get(url)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('.col-lg-6'):
try:
headline = item.find('h2', {'class':'entry-title'}).get_text()
link = item.find('h2', {'class':'entry-title'})['href']
except:
continue
我所指的 html 是下面的那个。
<div class="col-lg-6 col-md-6 col-sm-7">
<div class="entry-header">
<span class="meta-category"><a href="https://emerging-europe.com/category/news/" class="herald-cat-210">News & Analysis</a></span>
<h2 class="entry-title h3"><a href="https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/">Montenegro leads CEE on ILGA-Europe’s new Rainbow Map</a></h2>
<div class="entry-meta"><div class="meta-item herald-date"><span class="updated">May 17, 2021</span></div><div class="meta-item herald-author"><span class="vcard author"><span class="fn"><a href="https://emerging-europe.com/author/marekgrzegorczyk/">Marek Grzegorczyk</a></span></span></div></div>
</div>
<div class="entry-content">
<p>Montenegro is Central and Eastern Europe’s best performer on the latest edition of the ILGA-Europe Rainbow Europe Map and Index, which monitors LGBTI rights across...</p>
</div>
<a class="herald-read-more" href="https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/" title="Montenegro leads CEE on ILGA-Europe’s new Rainbow Map">Read More</a>
</div>
我想获得“https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/”链接,但我得到的是“https” ://emerging-europe.com/category/news/”之一。我如何引用第二个?
感谢您的帮助!
答案 0 :(得分:1)
试试这个来获取所有的文章网址:
import requests
from bs4 import BeautifulSoup
url = "https://emerging-europe.com/tag/poland/"
css = ".entry-header .entry-title, .entry-header .entry-title a, .post-author-list .categoriesarticle .title a"
soup = BeautifulSoup(requests.get(url).text, "lxml").select(css)
article_links = [a.find("a")["href"] for a in soup if a.find("a") is not None]
print("\n".join(article_links))
输出:
https://emerging-europe.com/voices/the-zangezur-corridor-is-a-geo-economic-revolution/
https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/
https://emerging-europe.com/business/made-in-emerging-europe-vinted-up-catalyst-propergate/
https://emerging-europe.com/news/polish-government-shifts-left-on-economy/
https://emerging-europe.com/news/georgias-modern-parliament-building-faces-uncertain-future-elsewhere-in-emerging-europe/
https://emerging-europe.com/after-hours/mixed-feelings-as-libeskind-reimagines-lodz/
https://emerging-europe.com/news/hungarys-united-opposition-emerging-europe-this-week/
https://emerging-europe.com/business/small-local-market-think-international-from-the-start/
https://emerging-europe.com/business/new-esg-guidelines-can-strengthen-polish-capital-market/
https://emerging-europe.com/news/why-is-the-left-propping-up-polands-right-wing-government/
https://emerging-europe.com/news/cee-should-redouble-efforts-to-end-violence-against-women/
https://emerging-europe.com/after-hours/a-century-on-the-silesian-uprisings-remains-complicated/
https://emerging-europe.com/voices/the-zangezur-corridor-is-a-geo-economic-revolution/
https://emerging-europe.com/news/montenegro-leads-cee-in-ilga-europes-new-rainbow-map/
https://emerging-europe.com/business/made-in-emerging-europe-vinted-up-catalyst-propergate/