无法创建适当的选择器来抓取某些特定链接

时间:2018-01-14 20:53:55

标签: python python-3.x web-scraping beautifulsoup css-selectors

我已经使用BeautifulSoup在python中编写了一个脚本,以便在网页中标题为VIDEOS BY YEAR的章节中的左侧栏中找到一些特定的网址。问题是,如果我在我的脚本中使用硬编码的数字,我可以解析这些特定网址,如下所示。但是,我的目的是在我的脚本中不使用任何硬编码的数字来获取那些确切的URL。事实上,我在任何css selector之后做同样的事情。希望有人能伸出援助之手来实现这一目标。

这是我迄今为止所尝试过的:

import requests
from bs4 import BeautifulSoup

URL = "https://www.wiseowl.co.uk/videos/"
response = requests.get(URL)
soup = BeautifulSoup(response.text,"html5lib")
for item in soup.select(".woMenuList .woMenuItem a")[-7:]:
    print(item['href'])

它产生以下结果:

/videos/year/2011.htm
/videos/year/2012.htm
/videos/year/2013.htm
/videos/year/2014.htm
/videos/year/2015.htm
/videos/year/2016.htm
/videos/year/2017.htm

网址所在的Html元素:

<ul class="woMenuList">

    <li class="woMenuItem"><a href="/videos/year/2011.htm">2011 (19)</a></li>
    <li class="woMenuItem"><a href="/videos/year/2012.htm">2012 (45)</a></li>
    <li class="woMenuItem"><a href="/videos/year/2013.htm">2013 (29)</a></li>
    <li class="woMenuItem"><a href="/videos/year/2014.htm">2014 (62)</a></li>
    <li class="woMenuItem"><a href="/videos/year/2015.htm">2015 (25)</a></li>
    <li class="woMenuItem"><a href="/videos/year/2016.htm">2016 (46)</a></li>
    <li class="woMenuItem"><a href="/videos/year/2017.htm">2017 (24)</a></li>

</ul>

顺便说一下,所有的类别和链接都在类似的类和标签类型中,这就是我被卡住的原因。提前感谢您仔细研究。

1 个答案:

答案 0 :(得分:1)

您可以使用*=运算符仅选择包含字符串'/videos/year'的链接。

import requests
from bs4 import BeautifulSoup

URL = "https://www.wiseowl.co.uk/videos/"
response = requests.get(URL)
soup = BeautifulSoup(response.text,"html5lib")
for item in soup.select(".woMenuList .woMenuItem a[href*='/videos/year']"):
    print(item['href'])