解析链接的所有子页面并下载包含的特定文件

时间:2019-05-28 19:39:19

标签: python regex urllib

我已经编写了一个代码来下载页面的免费pdf,所以我不手动进行操作,如我为特定子集类别(例如https:// www ... ....... / MagPi01)。 现在,我想扩展代码以获取主页之后的每个链接,该链接不仅包含“ MagPi {}。format(1,2,3 ...),而且包含带有正则表达式的(。*)”。 我出于教育目的以这种方式(正则表达式)进行尝试。

import urllib.request
import os

path = "C:/Users/kosmas/Desktop/MagPi"
try:
    os.mkdir(path)
except OSError:
    print ("Creation of the directory %s failed" % path)
else:
    print ("Successfully created the directory %s " % path)
try:
    i = 1
    while i < 10:
            url = "https://www.raspberrypi.org/magpi-issues/MagPi0{}.pdf".format(i)
            opener = urllib.request.build_opener()
            opener.addheaders = [('User-agent', 'Mozilla/5.0')]
            urllib.request.install_opener(opener)
            urllib.request.urlretrieve(url, "C:/Users/kosmas/Desktop/MagPi/MagPi0{}.pdf".format(i))
            print("MagPi0{}.pdf created successfully".format(i))
            i = i + 1

except:
        print('Something went wrong')

try:
    i = 10
    while i < 82:
            url = "https://www.raspberrypi.org/magpi-issues/MagPi{}.pdf".format(i)
            opener = urllib.request.build_opener()
            opener.addheaders = [('User-agent', 'Mozilla/5.0')]
            urllib.request.install_opener(opener)
            urllib.request.urlretrieve(url, "C:/Users/kosmas/Desktop/MagPi/MagPi{}.pdf".format(i))
            print("MagPi{}.pdf created successfully".format(i))
            i = i + 1

except:
        print('Something went wrong')

0 个答案:

没有答案