如何更改与pdf链接对应的名称:

时间:2017-05-19 10:06:17

标签: python csv web-scraping python-requests

如何替换链接前的下载pdf文件的名称

我想将它保存为elkinson.pdf而不是Elkinson%20Jeffrey.pdf

CSV文件如下所示:

  

elkinson https://www.adndrc.org/diymodule/doc_panellist/Elkinson%20Jeffrey.pdf

     

papers_report http://www.parliament.bm/uploadedFiles/Content/House_Business/Presentation_of_Papers_and_of_Reports/PCA%20Report%209262014.pdf

代码:

import os
import csv
import requests

write_path = 'C:\\Users\\hgdht\\Desktop\\Downloader_Automation'  # ASSUMING THAT FOLDER EXISTS!

with open('Links.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile)
    for link in spamreader:
        if not link:
            continue
        print('-'*72)
        pdf_file = link[0].split('/')[-1]
        with open(os.path.join(write_path, pdf_file), 'wb') as pdf:
            try:
                # Try to request PDF from URL
                print('TRYING {}...'.format(link[0]))
                a = requests.get(link[0], stream=True)
                for block in a.iter_content(512):
                    if not block:
                        break

                    pdf.write(block)
                print('OK.')
            except requests.exceptions.RequestException as e:  # This 
will catch ONLY Requests exceptions
                print('REQUESTS ERROR:')
                print(e)  # This should tell you more details about the error

1 个答案:

答案 0 :(得分:0)

在您的代码中,变量pdf_file包含文件名(Presentation_of_Papers_and_of_Reports / PCA%20Report%209262014.pdf),因此您可以使用python regex用空格替换该特殊字符串

pdf_file =re.sub(r'%[\d]+',' ',pdf_file).lower()

前:

import re
pdf_file = "Presentation_of_Papers_and_of_Reports/PCA%20Report%209262014.pdf"
pdf_file =re.sub(r'%[\d]+',' ',pdf_file).lower()

输出:'presentation_of_papers_and_of_reports / pca report .pdf'