Question

我正在开发一个网络抓取项目，并有一个网址列表。一些URL是相对URL，我需要在任何返回的相对URL值前面添加根URL（'https://www.census.gov'），以（'/'）开头。这是我的for循环：

links = soup.find_all('a', href=True)
records = []
for results in links:
    url = results['href']
    records.append(url)

我想我有if语句的开头：

if url.startswith('/'):

但不确定如何完成它。任何提示或指导赞赏！

谢谢，盖瑞特

Answer 1

不要滚动自己，而是从标准库中尝试urjloin。它同时处理相对和绝对URL。

>>> from urllib.parse import urljoin
>>> base = 'https://www.census.gov/'
>>> relative = '/here/is/some/path'
>>> urljoin(base, relative)
'http://www.census.gov/here/is/some/path'
>>> not_relative =  'http://www.census.gov/here/is/another/path'
>>> urljoin(base, not_relative)
'http://www.census.gov/here/is/another/path'

但是，如果您的绝对网址具有不同的域名，则您需要不加入这些网址。在这种情况下，您可以这样做：

if url.startswith('/'):
    url = urljoin(base, url)

Answer 2

如果我理解正确，你可以尝试这样的事情：

import requests
from bs4 import BeautifulSoup

ROOT_URL = 'https://www.census.gov'

def scrape():
    r = requests.get(ROOT_URL)
    # soup = BeautifulSoup(URL, 'html.parser')
    soup = BeautifulSoup(r.text)
    links = soup.find_all('a', href=True)
    records = []
    for results in links:
        url = results['href']
        print('URL: ', url)
        if url.startswith('#'):
            continue
        elif url.startswith('/'):
            url = ROOT_URL + url
            records.append(url)
            print('PROPER URL: ', url)

if __name__ == '__main__':
    scrape()

它会在所有相关链接的前面添加ROOT_URL。

Answer 3

您可以使用urlparse和_replace方法。这适用于您的两种情况

>>> from urllib.parse import urlparse

>>> base_url = 'https://www.census.gov'
>>> urlparse('https://www.census.gov/path/to/text')._replace(netloc=base_url)

这会给你如下结果：

>>> ParseResult(scheme='https', netloc='https://www.census.gov', path='/path/to/text', params='', query='', fragment='')

要使用不带base_url的方法进行解析，请使用相同的方法

>>> urlparse('/path/to/text')._replace(netloc=base_url)

>>> ParseResult(scheme='https', netloc='https://www.census.gov', path='/path/to/text', params='', query='', fragment='')

要获取组合网址作为字符串使用：

>>> url_comp = urlparse('/path/to/text')._replace(netloc=base_url)

>>> url_comp.netloc + url_comp.path
>>> 'https://www.census.gov/path/to/text'

Answer 4

有了这个：

if not url.startswith('/'):
    url
elif url.startswith('/'):
    url = 'https://www.census.gov' + url

谢谢，盖瑞特

python将根URL提供给相对链接

4 个答案: