抓取两种类型的URL

时间:2018-11-08 12:45:20

标签: python web-scraping beautifulsoup

this page中,“依赖项”列表下有两种类型的URL。其中一个来自官方软件包网站(“ https://archlinux.org/packages/”),另一个来自用户软件包网站(“ https://aur.archlinux.org/packages/”)。我想将它们提取为单独的列表。根据{{​​3}}的说法,到目前为止,我想到的是这样的内容:

sauce = urllib.request.urlopen("https://aur.archlinux.org/packages/blockify/").read()
soup = bs.BeautifulSoup(sauce, 'lxml')
official_dependencies = []
aur_dependencies = []

for h3 in soup.find_all('h3'):
    if "Dependencies" in h3.text:
        for url in h3.find_all_next('a', attrs={'href': re.compile("^https://www.archlinux.org/packages/")}):
            official_dependencies.append(url.get('href'))

这很适合我的第一个目标。但是我不确定应该如何提取aur依赖项,因为它们的href类似于/packages/package_name/而不是https://aur.archlinux.org/packages/package_name/。并且,在正式包名称旁边的括号中也有一些aur依赖项。例如,alsa-utils (alsa-utils-transparent)。我想避免报废那些替代性的aur软件包。

我对bs4相对较新,并且不知道正则表达式,因此我对应该如何处理该问题感到有些困惑。如果有人可以向我展示解决此问题的方法,我将非常高兴。

谢谢

1 个答案:

答案 0 :(得分:2)

如果您不一定要坚持使用bs4,则可以尝试lxml.html解决方案

from lxml import html

response = urllib.request.urlopen("https://aur.archlinux.org/packages/blockify/").read()
source = html.fromstring(response)

all_links = source.xpath('//ul[@id="pkgdepslist"]/li/a/@href')

simple_links = [link for link in all_links if link.startswith('https')]
aur_links = ['https://aur.archlinux.org' + link for link in all_links if not link.startswith('https')]

print(simple_links)
['https://www.archlinux.org/packages/?q=alsa-utils', 'https://www.archlinux.org/packages/?q=gst-python', 'https://www.archlinux.org/packages/?q=pygtk', 'https://www.archlinux.org/packages/?q=python-dbus', 'https://www.archlinux.org/packages/?q=python-docopt', 'https://www.archlinux.org/packages/?q=wmctrl', 'https://www.archlinux.org/packages/?q=python-setuptools', 'https://www.archlinux.org/packages/?q=pulseaudio']

print(aur_links)
['https://aur.archlinux.org/packages/spotify/']