Question

我想从网页上下载大约1000个pdf文件。然后，我遇到了这种笨拙的pdf网址格式。 requests.get()和urllib.request.urlretrieve()都不适合我。

通常的pdf网址如下：

https://webpage.com/this_file.pdf

但是此网址类似于：

https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9000001&fileSn=1&wrtFileTy=01

因此它的URL中没有.pdf，如果单击它，则可以下载它。但是，使用python的urllib，您会损坏文件。

起初，我认为它已被重定向到其他URL。所以我使用了request.get(url, allow_retrieves=True)选项，结果与以前的网址相同。

filename = './novel/pdf1.pdf'
url = 'https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9031938&fileSn=1&wrtFileTy=01'

urllib.request.urlretrieve(url, filename)

此代码下载损坏的pdf文件。

Answer 1

我使用检索到的对象中的content字段解决了它。


filename = './novel1/pdf1.pdf'
url = . . .

object = requests.get(url)
with open('./novels/'+filename, 'wb') as f:
    f.write(t.content)

指此QnA； Download and save PDF file with Python requests module

urllib.request.urlretrieve返回损坏的文件（如何处理此类url？）

1 个答案: