Beautifulsoup从给定的网站菜单中提取网址

时间:2019-07-23 14:38:06

标签: python beautifulsoup

您好,我是beautifulsoup的新手,我正在尝试编写一个函数,该函数能够从给定的网站中提取二级URL。

例如,如果我拥有以下网站网址:https://edition.cnn.com/,我的函数应该可以返回

https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel

首先,我尝试使用此代码来检索以url的字符串开头的所有链接:

from bs4 import BeautifulSoup as bs4
import requests
import lxml
import re
def getLinks(url):
  response = requests.get(url)
  data = response.text
  soup = bs4(data, 'lxml')
  links = []
  for link in soup.find_all('a', href=re.compile(str(url))):
    links.append(link.get('href'))
  return links

但是再一次,实际输出却给了我所有链接,甚至是我不是在寻找的文章链接。有没有一种方法可以用来使用正则表达式或其他表达式来获取想要的内容。

1 个答案:

答案 0 :(得分:1)

链接位于<nav>标记内,因此使用CSS选择器nav a[href]将仅选择<nav>标记内的链接:

import requests
from bs4 import BeautifulSoup

url = 'https://edition.cnn.com'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

for a in soup.select('nav a[href]'):
    if a['href'].count('/') > 1 or '#' in a['href']:
        continue
    print(url + a['href'])

打印:

https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
https://edition.cnn.com/sport
https://edition.cnn.com/videos
https://edition.cnn.com/world
https://edition.cnn.com/africa
https://edition.cnn.com/americas
https://edition.cnn.com/asia
https://edition.cnn.com/australia
https://edition.cnn.com/china
https://edition.cnn.com/europe
https://edition.cnn.com/india
https://edition.cnn.com/middle-east
https://edition.cnn.com/uk

...and so on.