如何添加主链接到子链接html,以便可以调用链接?

时间:2018-02-02 14:25:33

标签: python html python-3.x parsing

这是我的代码,它提供了HTML页面中特定新闻链接的列表,它只包含资源名称和参数,我想包含主域名,以便链接可以操作。

import requests
from bs4 import BeautifulSoup


def get_cric_info_articles():

    cricinfo_article_link = "http://www.espncricinfo.com/ci/content/story/news.html"

    r = requests.get(cricinfo_article_link)
    cricinfo_article_html = r.text

    soup = BeautifulSoup(cricinfo_article_html, "html.parser")
    # print(soup.prettify())

    cric_info_items = soup.find_all("h2",
                                    {"class": "story-title"})
    cricinfo_article_dict = {}

    for div in cric_info_items:
        cricinfo_article_dict[div.find('a').string] = div.find('a')['href']

    return cricinfo_article_dict


print(get_cric_info_articles())

我得到了什么{'Bell-Drummond leads MCC in curtain-raiser': '/ci/content/story/1135157.html', 'Scotland pick Brad Wheal, Chris Sole for World Cup qualifiers': '/scotland/content/story/1135152.html', 'Newlands working to be water independent': '/southafrica/content/story/1135120.html'}

我正在尝试将此'/ci/content/story/1135157.html'附加到http://www.espncricinfo.com/
所以最终的链接是http://www.espncricinfo.com/ci/content/story/1135157.html',我该怎么做?对不起,长篇文章

我做的改变

for div in cric_info_items:
        a = div.find('a')['href']
        b = 'http://www.espncricinfo.com/'
        c = urljoin(b,a)
        cricinfo_article_dict[div.find('a').string] = c

2 个答案:

答案 0 :(得分:1)

您可以使用urllib.parse模块:

from urllib.parse import urljoin
urljoin('http://www.espncricinfo.com/', '/ci/content/story/1135157.html')

希望它有所帮助。

答案 1 :(得分:1)

DataSource

或者,使用dict理解:

...
# if protocol is not specified in the link, assume it's relative
for div in cric_info_items:
    url = div.find('a')['href']
    if "://" not in url:
        url = cricinfo_article_link + url
    cricinfo_article_dict[div.find('a').string] = url
...

更新:潜在的边境案例是以return { div.find('a').string : ("" if "://" in div.find('a')['href'] else cricinfo_article_link) + div.find('a')['href'] for div in soup.find_all("h2", {"class": "story-title"}) } 开头的链接,例如//。这种类型的链接有时用于资源(css和javascript),并不常用于外部链接。但是,您可能也希望处理此问题