Question

我有一组指向pdf文件的链接：

https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf

其中一些是受限制的，这意味着我将无法访问pdf文件，而其他人将直接转到pdf文件本身，如上面的链接。

我目前正在使用请求包（python）来访问文件，但是有很多文件供我下载，而且我也不想要pdf中的文件。

我想要做的是转到每个链接，检查链接是否为pdf文件，下载该文件（如有必要），将其转换为txt文件，然后删除原始pdf文件。

我有一个非常好的pdf到txt转换器的shell脚本，但是可以从python运行shell脚本吗？

Answer 1

是的！完全可以从python运行shell脚本。看一下子进程python模块，它允许你创建一个如何使用shell的进程：https://docs.python.org/2/library/subprocess.html

例如：

import subprocess

process = subprocess.Popen(["echo", "message"], stdout=subprocess.PIPE)

print process.communicate()

有许多教程，例如：http://www.bogotobogo.com/python/python_subprocess_module.php

Answer 2

Kieran Bristow关于如何从Python运行外部程序的问题有answered部分。

问题的另一部分是通过检查资源是否为PDF文档来选择性地下载文档。除非远程服务器提供其文档的替代表示（例如文本版本），否则您将需要下载文档。为避免下载非PDF文档，您可以发送初始HEAD请求并查看回复标题以确定content-type，如下所示：

import os.path
import requests

session = requests.session()

for url in [
    'https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf',
    'https://www.duo.uio.no/bitstream/10852abcd/90121023/1234/oppgave-2003-10-30.pdf']:
    try:
        resp = session.head(url, allow_redirects=True)
        resp.raise_for_status()
        if resp.headers['content-type'] == 'application/pdf':
            resp = session.get(url)
            if resp.ok:
                with open(os.path.basename(url), 'wb') as outfile:
                    outfile.write(resp.content)
                    print "Saved {} to file {}".format(url, os.path.basename(url))
            else:
                print 'GET request for URL {} failed with HTTP status "{} {}"'.format(url, resp.status_code, resp.reason)
    except requests.HTTPError as exc:
        print "HEAD failed for URL {} : {}".format(url, exc)

来自http请求的Pdf到txt

2 个答案: