Question

我的参考代码：

import httplib2
from bs4 import BeautifulSoup

h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
    urls.append(tag['href'])
responses = []
contents = []
for url in urls:
    try:
        response1, content1 = h.request(url)
        responses.append(response1)
        contents.append(content1)
    except:
        pass

我的想法是，我获取网页的有效负载，然后抓取超链接。其中一个链接是yahoo.com，另一个链接是＆＃39; http://csb.stanford.edu/class/public/index.html＆＃39;

然而，我从BeautifulSoup获得的结果是：

>>> urls
['http://www.yahoo.com/', '../../index.html']

这提出了一个问题，因为脚本的第二部分无法在第二个缩短的url上执行。有没有办法让BeautifulSoup检索完整的URL？

Answer 1

那是因为网页上的链接实际上就是那种形式。页面中的HTML是：

<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>

这称为相对链接。

要将此转换为绝对链接，您可以使用标准库中的urljoin。

from urllib.parse import urljoin  # Python3

urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
        '../../index.html')
# returns http://csb.stanford.edu/class/public/index.html

BeautifulSoup缩短了同一网站上的网页网址缩短

1 个答案: