无法将下载的文件存储在其相关文件夹中

时间:2019-02-11 08:27:01

标签: python python-3.x selenium selenium-webdriver web-scraping

我已经用python与硒结合编写了一个脚本,可以从网页上下载少量文档文件(以.doc结尾)。我不希望使用requestsurllib模块下载文件的原因是因为我当前正在申请访问的网站没有将真实URL连接到每个文件。它们是javascript加密的。但是,我在脚本中选择了一个链接来模仿它。

此刻我的脚本做什么:

  1. 在桌面上创建一个主文件夹
  2. 在主文件夹中创建包含要下载文件名称的子文件夹
  3. 首先下载文件,然后单击它们的链接,然后将文件放在主文件夹中。 (this is what I need rectified)
  

如何修改脚本以下载文件,并单击链接并将其下载到相关文件夹中?

这是我到目前为止的尝试:

import os
import time
from selenium import webdriver

link ='https://www.online-convert.com/file-format/doc' 

dirf = os.path.expanduser('~')
desk_location = dirf + r'\Desktop\file_folder'
if not os.path.exists(desk_location):os.mkdir(desk_location)

def download_files():
    driver.get(link)
    for item in driver.find_elements_by_css_selector("a[href$='.doc']")[:2]:
        filename = item.get_attribute("href").split("/")[-1]
        #creating new folder in accordance with filename to store the downloaded file in thier concerning folder
        folder_name = item.get_attribute("href").split("/")[-1].split(".")[0]
        #set the new location of the folders to be created
        new_location = os.path.join(desk_location,folder_name)
        if not os.path.exists(new_location):os.mkdir(new_location)
        #set the location of the folders the downloaded files will be within
        file_location = os.path.join(new_location,filename)
        item.click()

        time_to_wait = 10
        time_counter = 0
        try:
            while not os.path.exists(file_location):
                time.sleep(1)
                time_counter += 1
                if time_counter > time_to_wait:break
        except Exception:pass

if __name__ == '__main__':
    chromeOptions = webdriver.ChromeOptions()
    prefs = {'download.default_directory' : desk_location,
            'profile.default_content_setting_values.automatic_downloads': 1
        }
    chromeOptions.add_experimental_option('prefs', prefs)
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    download_files()

下图表示当前(the files are outside of their concerning folders)存储已下载文件的方式:

enter image description here

3 个答案:

答案 0 :(得分:6)

我刚刚添加了文件的重命名来移动它。因此它可以像您所拥有的一样工作,但是一旦下载了文件,便会将其移动到正确的路径:

os.rename(desk_location + '\\' + filename, file_location)

完整代码:

import os
import time
from selenium import webdriver

link ='https://www.online-convert.com/file-format/doc' 

dirf = os.path.expanduser('~')
desk_location = dirf + r'\Desktop\file_folder'
if not os.path.exists(desk_location):
    os.mkdir(desk_location)

def download_files():
    driver.get(link)
    for item in driver.find_elements_by_css_selector("a[href$='.doc']")[:2]:
        filename = item.get_attribute("href").split("/")[-1]
        #creating new folder in accordance with filename to store the downloaded file in thier concerning folder
        folder_name = item.get_attribute("href").split("/")[-1].split(".")[0]
        #set the new location of the folders to be created
        new_location = os.path.join(desk_location,folder_name)
        if not os.path.exists(new_location):
            os.mkdir(new_location)
        #set the location of the folders the downloaded files will be within
        file_location = os.path.join(new_location,filename)
        item.click()

        time_to_wait = 10
        time_counter = 0

        try:
            while not os.path.exists(file_location):
                time.sleep(1)
                time_counter += 1
                if time_counter > time_to_wait:break
            os.rename(desk_location + '\\' + filename, file_location)
        except Exception:pass

if __name__ == '__main__':
    chromeOptions = webdriver.ChromeOptions()
    prefs = {'download.default_directory' : desk_location,
            'profile.default_content_setting_values.automatic_downloads': 1
        }
    chromeOptions.add_experimental_option('prefs', prefs)
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    download_files()

答案 1 :(得分:0)

在声明Driver对象时使用此代码(这对于Java,Python也将具有类似的方式来实现) 每次都会将文件下载到指定位置。

    //Create preference object
    HashMap<String, Object> chromePrefs = new HashMap<String , Object>();   
    //Set Download path  
    chromePrefs.put("download.default_directory","C:\\Reports\\AutomaionDownloads");
        chromePrefs.put("download.directory_upgrade", true);
        ChromeOptions options = new ChromeOptions();
        options.setExperimentalOption("prefs", chromePrefs);    
        //Call the Chrome Driver
        WebDriver driver = new ChromeDriver(options); 

答案 2 :(得分:-1)

使用Python 3中的pathlib库或Python 2中的pathlib2库来处理路径。它提供了一种面向对象的方式来处理文件和目录。它还具有PurePath对象,该对象可以使用路径而无需接触文件系统。