在this page中,“依赖项”列表下有两种类型的URL。其中一个来自官方软件包网站(“ https://archlinux.org/packages/”),另一个来自用户软件包网站(“ https://aur.archlinux.org/packages/”)。我想将它们提取为单独的列表。根据{{3}}的说法,到目前为止,我想到的是这样的内容:
sauce = urllib.request.urlopen("https://aur.archlinux.org/packages/blockify/").read()
soup = bs.BeautifulSoup(sauce, 'lxml')
official_dependencies = []
aur_dependencies = []
for h3 in soup.find_all('h3'):
if "Dependencies" in h3.text:
for url in h3.find_all_next('a', attrs={'href': re.compile("^https://www.archlinux.org/packages/")}):
official_dependencies.append(url.get('href'))
这很适合我的第一个目标。但是我不确定应该如何提取aur依赖项,因为它们的href
类似于/packages/package_name/
而不是https://aur.archlinux.org/packages/package_name/
。并且,在正式包名称旁边的括号中也有一些aur依赖项。例如,alsa-utils (alsa-utils-transparent)
。我想避免报废那些替代性的aur软件包。
我对bs4相对较新,并且不知道正则表达式,因此我对应该如何处理该问题感到有些困惑。如果有人可以向我展示解决此问题的方法,我将非常高兴。
谢谢
答案 0 :(得分:2)
如果您不一定要坚持使用bs4,则可以尝试lxml.html
解决方案
from lxml import html
response = urllib.request.urlopen("https://aur.archlinux.org/packages/blockify/").read()
source = html.fromstring(response)
all_links = source.xpath('//ul[@id="pkgdepslist"]/li/a/@href')
simple_links = [link for link in all_links if link.startswith('https')]
aur_links = ['https://aur.archlinux.org' + link for link in all_links if not link.startswith('https')]
print(simple_links)
['https://www.archlinux.org/packages/?q=alsa-utils', 'https://www.archlinux.org/packages/?q=gst-python', 'https://www.archlinux.org/packages/?q=pygtk', 'https://www.archlinux.org/packages/?q=python-dbus', 'https://www.archlinux.org/packages/?q=python-docopt', 'https://www.archlinux.org/packages/?q=wmctrl', 'https://www.archlinux.org/packages/?q=python-setuptools', 'https://www.archlinux.org/packages/?q=pulseaudio']
print(aur_links)
['https://aur.archlinux.org/packages/spotify/']