Question

我想从网页HTML源中提取网址例如：

xyz.com source code:
<a rel="nofollow" href="example/hello/get/9f676bac2bb3.zip">Download XYZ</a>

我想提取：

example/hello/get/9f676bac2bb3.zip

如何提取此网址？

我不了解正则表达式。另外，我不知道如何在Windows上安装Beautiful Soup 4或lxml。我尝试安装此库时遇到错误。

我试过了：

C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>

但它只是一个硬编码的HTML示例。如何获取网页源代码并针对它运行我的代码？

Answer 1

您可以使用内置xml.etree.ElementTree代替：

>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

这适用于此特定示例，但xml.etree.ElementTree不是HTML解析器。考虑使用BeautifulSoup：

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'

或者，lxml.html：

>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

就个人而言，我更喜欢BeautifulSoup - 它使得html解析变得简单，透明和有趣。

要关注该链接并下载该文件，您需要制作包含架构和域名的完整网址（urljoin()会有所帮助），然后使用urlretrieve()。例如：

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

UPD（对于评论中发布的不同html）：

>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'

如何使用Python3从HTML锚元素中提取URL？

1 个答案: