Question

我需要找到一种方法来下载在给定网址中找到的所有pdf文件，我发现了一个脚本，据说 - 我还没有测试过它 - 完成了这个任务：

import urllib.parse
import urllib2
import os
import sys
from bs4 import BeautifulSoup

from urllib3 import request

url = "https://...."

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0"}

i = 0

request = urlib2.request(url, None, headers)

html = urllib2.urlopen(request)

soup = BeuatifulSoup(html.read())

for tag in soup.findAll("a" , href = True)

    tag["href"] = urlparse.urljoin(url, tag["href"])

    if os.path.splitext(os.path.basename(tag["href"]))[1] == ".pdf"

        current = urllib2.urlopen(tag["href"])

        print("\n[*] Downloading: %s" %(os.path.basename(tag["href"])))

        f = open(download_path + "\\" + os.path.basename(tag["href"], "wb"))

        f.write(current.read())

        f.close()

        i += 1

print("\n[*] Downloaded %d files" %(i + 1))

raw_input("[+] Press any key to exit ... ")

问题是我安装了Python 3.3并且此脚本不能与Python 3.3一起运行。例如。 urllib2不适用于Python 3.3。

你能告诉我如何修改脚本以与Python 3.3兼容吗？

我将非常感谢您的帮助。

Answer 1

为什么不作为3行shell脚本只需要一个perl模块？

mech-dump --links http://domain.tld/path |
grep -i '\.pdf$' |
xargs wget -n1

用于debian和衍生物的包libwww-mechanize-perl

Answer 2

为什么没有一行bash：wget -r -l1 -A.pdf http://www.example.com/page-with-pdfs.htm

Answer 3

对于Python 3，您应该使用import urllib.request而不是urllib2。首先评估您要解析的url的html源代码很重要。例如，有些可能具有og_url属性，而另一些可能没有。取决于此，提取pdf链接的方式可能会有所不同。

这里有一个快速的解决方案，以及关于下载pdf的详细说明：

https://medium.com/@dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Answer 4

正如有人指出的那样，shell脚本可能是实现目标的更好方法。

但是，如果您设置使用python来执行此操作，则可以保持python 3.3环境不变，并安装所谓的“虚拟环境”。在虚拟环境中，您可以拥有所需的任何Python版本和库，并且它不会干扰您当前的Python安装。

有一个很好的教程here可以开始使用虚拟环境。

使用Python从URL地址下载所有pdf文件

4 个答案: