Question

我正在使用python 2和Beautiful soup来解析使用请求模块检索的HTML

import requests
from bs4 import BeautifulSoup

site = requests.get("http://www.stackoverflow.com/")
HTML = site.text
links = BeautifulSoup(HTML).find_all('a')

返回包含<a href="hereorthere.com">Navigate</a>

输出的列表

每个锚标记的属性href的内容可以有多种形式，例如它可以是页面上的javascript调用，它可以是具有相同域的页面的相对地址{{ 1}}，或者它可以是特定的网址（http://www.stackoverflow.com/）。

使用BeautifulSoup可以将相对地址和特定地址的Web地址返回到一个列表，不包括所有javascript调用等，只留下可导航的链接？

Answer 1

来自BS docs：

One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
    print(link.get('href'))

Answer 2

你可以过滤出href =＆＃34; javascript：whatever（）＆＃34;像这样的案例：

hrefs = []
for link in soup.find_all('a'):
    if link.has_key('href') and not link['href'].lower().startswith('javascript:'):
        hrefs.append(link['href'])

如何使用美丽的汤从HTML锚标签返回目的地

2 个答案: