这是我的代码,它提供了HTML页面中特定新闻链接的列表,它只包含资源名称和参数,我想包含主域名,以便链接可以操作。
import requests
from bs4 import BeautifulSoup
def get_cric_info_articles():
cricinfo_article_link = "http://www.espncricinfo.com/ci/content/story/news.html"
r = requests.get(cricinfo_article_link)
cricinfo_article_html = r.text
soup = BeautifulSoup(cricinfo_article_html, "html.parser")
# print(soup.prettify())
cric_info_items = soup.find_all("h2",
{"class": "story-title"})
cricinfo_article_dict = {}
for div in cric_info_items:
cricinfo_article_dict[div.find('a').string] = div.find('a')['href']
return cricinfo_article_dict
print(get_cric_info_articles())
我得到了什么{'Bell-Drummond leads MCC in curtain-raiser': '/ci/content/story/1135157.html', 'Scotland pick Brad Wheal, Chris Sole for World Cup qualifiers': '/scotland/content/story/1135152.html', 'Newlands working to be water independent': '/southafrica/content/story/1135120.html'}
我正在尝试将此'/ci/content/story/1135157.html'
附加到http://www.espncricinfo.com/
所以最终的链接是http://www.espncricinfo.com/ci/content/story/1135157.html',我该怎么做?对不起,长篇文章
我做的改变
for div in cric_info_items:
a = div.find('a')['href']
b = 'http://www.espncricinfo.com/'
c = urljoin(b,a)
cricinfo_article_dict[div.find('a').string] = c
答案 0 :(得分:1)
您可以使用urllib.parse
模块:
from urllib.parse import urljoin
urljoin('http://www.espncricinfo.com/', '/ci/content/story/1135157.html')
希望它有所帮助。
答案 1 :(得分:1)
DataSource
或者,使用dict理解:
...
# if protocol is not specified in the link, assume it's relative
for div in cric_info_items:
url = div.find('a')['href']
if "://" not in url:
url = cricinfo_article_link + url
cricinfo_article_dict[div.find('a').string] = url
...
更新:潜在的边境案例是以return {
div.find('a').string : ("" if "://" in div.find('a')['href'] else cricinfo_article_link) + div.find('a')['href']
for div in soup.find_all("h2", {"class": "story-title"})
}
开头的链接,例如//
。这种类型的链接有时用于资源(css和javascript),并不常用于外部链接。但是,您可能也希望处理此问题