如何从xpath获取绝对URL?

时间:2017-01-09 20:51:36

标签: python xpath lxml

我使用以下代码获取项目的网址:

node.xpath('//td/a[starts-with(text(),"itunes")]')[0].attrib['href']

它给了我类似的东西:

itunes20170107.tbz

但是,我希望获得完整的网址,即:

https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/current/itunes20170109.tbz

有没有一种简单的方法可以从lxml获取完整的url,而无需自己构建它?

2 个答案:

答案 0 :(得分:2)

lxml.html只会解析HTML中的href。如果您想要建立绝对链接而不是相对链接,则应使用urljoin()

from urllib.parse import urljoin  # Python3
# from urlparse import urljoin  # Python2 

url = "https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/current"

relative_url = node.xpath('//td/a[starts-with(text(),"itunes")]')[0].attrib['href']
absolute_url = urljoin(url, relative_url)

演示:

>>> from urllib.parse import urljoin  # Python3
>>> 
>>> url = "https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/current"
>>> 
>>> relative_url = "itunes20170107.tbz"
>>> absolute_url = urljoin(url, relative_url)
>>> absolute_url
'https://feeds.itunes.apple.com/feeds/epf/v3/full/20170105/incremental/itunes20170107.tbz'

答案 1 :(得分:2)

另一种方法:

import requests
from lxml import fromstring

url = 'http://server.com'
response = reqests.get(url)
etree = fromstring(response.text)
etree.make_links_absolute(url)`