如何使用Python3从HTML锚元素中提取URL?

时间:2014-08-04 14:32:21

标签: python regex python-3.x python-3.2

我想从网页HTML源中提取网址 例如:

xyz.com source code:
<a rel="nofollow" href="example/hello/get/9f676bac2bb3.zip">Download XYZ</a>

我想提取:

example/hello/get/9f676bac2bb3.zip

如何提取此网址?

我不了解正则表达式。另外,我不知道如何在Windows上安装Beautiful Soup 4lxml。我尝试安装此库时遇到错误。

我试过了:

C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>

但它只是一个硬编码的HTML示例。如何获取网页源代码并针对它运行我的代码?

1 个答案:

答案 0 :(得分:3)

您可以使用内置xml.etree.ElementTree代替:

>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

这适用于此特定示例,但xml.etree.ElementTree不是HTML解析器。考虑使用BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'

或者,lxml.html

>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

就个人而言,我更喜欢BeautifulSoup - 它使得html解析变得简单,透明和有趣。


要关注该链接并下载该文件,您需要制作包含架构和域名的完整网址(urljoin()会有所帮助),然后使用urlretrieve()。例如:

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

UPD(对于评论中发布的不同html):

>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'