Question

我找不到如何获取网站的完整地址：我得到例如“/ wiki / Main_Page”而不是“https://en.wikipedia.org/wiki/Main_Page”。我不能简单地将链接添加到链接中，因为它会给出：“https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page”这是不正确的。我的目标是使其适用于任何网站，因此我正在寻找一般解决方案。

以下是代码：

from bs4 import BeautifulSoup
import requests

url ="https://en.wikipedia.org/wiki/WKIK"
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    print "Found the URL:", link['href']

以下是它返回的部分内容：

>Found the URL: /wiki/WKIK_(AM)
>Found the URL: /wiki/WKIK-FM
>Found the URL: /wiki/File:Disambig_gray.svg
>Found the URL: /wiki/Help:Disambiguation
>Found the URL: //en.wikipedia.org/w/index.php?
>title=Special:WhatLinksHere/WKIK&namespace=0

Answer 1

也许这样的事情适合你：

for link in soup.find_all('a', href=True):
if 'en.wikipedia.org' not in link['href']:
    print("Found the URL:", 'https://en.wikipedia.org'+link['href'])
elif 'http' not in link['href']:
    print("Found the URL:", 'https://'+link['href'])
else:    
    print("Found the URL:", link['href'])

Answer 2

当您从element，href属性获取链接时。您几乎总会得到/ wiki / Main_Page之类的链接。

因为基本网址始终相同＆＃39; https://en.wikipedia.org＆＃39;。所以你需要做的是：

base_url = 'https://en.wikipedia.org'
search_url ="https://en.wikipedia.org/wiki/WKIK"
r  = requests.get(search_url)
data = r.content
soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    print ("Found the URL:", link['href'])
    if link['href'] != '#' and link['href'].strip() != '':
       final_url = base_url + link['href']

Answer 3

此处的其他答案可能会遇到某些相对URL的问题，例如包含句点（../page）的URL。

Python的requests库具有a function called urljoin来获取完整的URL：

requests.compat.urljoin(currentPage, link)

因此，如果您在https://en.wikipedia.org/wiki/WKIK上，并且页面上有一个href为/wiki/Main_Page的链接，则该函数将返回https://en.wikipedia.org/wiki/Main_Page。

如何使用BeautifulSoup获取完整的网址

3 个答案: