我正在尝试下载一个大号。 pdf的在线版(4000+)使用此代码。该代码对于某些文件有效,而对于另一些文件(几乎占一半),则工作良好,下载的文件已损坏,并且出现错误:“不支持文件类型HTML文档(文本/ html)”。请提出我应该进行的更改。
lis = pd.read_csv("/home/harshit/geography/equitylist.csv") # list of all equities on BSE
for i in lis["Security Code"]:
link = "https://www.bseindia.com/bseplus/AnnualReport/"+str(i)+"/"+str(i)+"0318.pdf"
r = requests.get(link) # getting and saving annual report
row=lis.loc[lis['Security Code'] ==i]
name = row.iloc[0]["Security Id"]
with open("reports2018incog/"+name+".pdf",'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
time.sleep(2)
答案 0 :(得分:1)
也许某些链接断开或返回重定向页面或404错误页面,因为错误提示您正在请求一个pdf文件,但并没有真正收到,因此我建议您检查文件是否为pdf。 1)检查标题
In [19]: page = requests.get("https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf")
In [20]: page.headers
Out[20]: {'Content-Type': 'application/pdf', 'Content-Length': '88226', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=86400', 'Last-Modified': 'Wed, 05 Jan 2005 19:56:38 GMT', 'Accept-Ranges': 'bytes', 'X-Adobe-Loc': 'uw2', 'X-Content-Type-Options': 'nosniff', 'Cache-Control': 'max-age=21590', 'Expires': 'Wed, 23 Jan 2019 04:53:53 GMT', 'Date': 'Tue, 22 Jan 2019 22:54:03 GMT', 'Connection': 'keep-alive'}
In [21]: page.headers['Content-Type']
Out[21]: 'application/pdf'
因此保存文件之前简单的if条件将是一个很好的开始!因此,这是针对特定问题的修订代码。
lis = pd.read_csv("/home/harshit/geography/equitylist.csv") # list of all equities on BSE
for i in lis["Security Code"]:
link = "https://www.bseindia.com/bseplus/AnnualReport/"+str(i)+"/"+str(i)+"0318.pdf"
r = requests.get(link) # getting and saving annual report
if r.headers['Content-type'] == "application/pdf":
row=lis.loc[lis['Security Code'] ==i]
name = row.iloc[0]["Security Id"]
with open("reports2018incog/"+name+".pdf",'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
time.sleep(2)
else:
print(f"Oops! Unable to process {link}")