使用python解析相对链接和绝对链接

时间:2014-12-24 03:53:59

标签: python html python-3.x beautifulsoup html-parsing

它是一个下载图像,音频,视频等项目。 但在某些网站上,我发现没有完整的链接。只是相对路径。 所以我不知道如何获得这些相关链接。

我的完整项目位于:

https://github.com/MuneebKalathil/MaD

这是我的示例链接,我想从此链接下载所有图像。有缩略图,但我不想要那些图像。如果单击缩略图,它将转到原始图像页面。我想下载那些图片

http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx

来源的某些部分是:

<tr>
<td id='pagingCell'>
</td>
</tr>
<tr>
<td align='center'><div id='galdiv' style='float:center;margin-right:3px;;margin-bottom:3px'>
<a href='/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars/image1.aspx' ><img src="http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham1t.jpg" alt="Kajal Aggarwal" title="Kajal Aggarwal at Dine with Stars Memu Saitham"></a>

而且,我想首先得到一个相对链接地址:

/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars/image1.aspx

找到它的绝对路径。

2 个答案:

答案 0 :(得分:5)

定义基本网址,找到所有img标记,如果src属性值不以http开头,请使用urlparse.urljoin()加入基本网址和{ {1}}。

示例,使用requestsBeautifulSoup

src

打印:

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

base_url = 'http://www.ragalahari.com'
url = 'http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx'

soup = BeautifulSoup(requests.get(url).content)

for img in soup.find_all('img', src=True):
    src = img.get('src')
    if not src.startswith('http'):
        src = urljoin(base_url, src)

    print(src)

更新(获取http://icdn.raagalahari.com/images/ragalaharilogo.png http://www.ragalahari.com/images/helpicon.png http://www.ragalahari.com/images/rssicon.png http://www.ragalahari.com/images/twittericon.png http://www.ragalahari.com/images/facebookicon.png http://www.ragalahari.com/images/searchicon.png http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham1t.jpg http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham2t.jpg http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham3t.jpg http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham4t.jpg ... 个链接的部分代码):

a

答案 1 :(得分:1)

使用urllib.parse.urljoin。将第一个参数传递给页面的URL。作为第二个参数,传递href或其他可能相对的URL。它将正确处理绝对和相对URL,将它们解析为最终的绝对URL。

如果您仍在使用Python 2,则urljoin模块中包含urlparse