Question

平台：
  Python版本：3.7.3
  硒版本：3.141.0
  作业系统：   Win7

问题：
我有一个URL列表作为文本文件，每个URL都位于单独的行中。网址是下载链接。我想遍历所有URL，并将链接到每个URL的文件下载到特定文件夹中。

我尝试过的代码是一个嵌套的for-while循环。 第一次迭代没有任何问题，但是第二次迭代陷入了一个while循环。

显然，有一种更好的方法可以做我想做的事情。我只是python的初学者，所以我会尽可能地学习该语言。

My Url List:

https://mega.nz/#!bOgBWKiB!AWs3JSksW0mpZ8Eob0-Qpr5ZAG0N1zhoFBFVstNJfXs
https://mega.nz/#!qPxGAAYJ!BX-hv7jgE4qvBs_uhHPVpsLRm1Yl4JkZ17nI1-U6hvk
https://mega.nz/#!GPoiHaaT!TAKT4sOhIiMUSFFSmlvPOidMcscXzHH_8HgK27LyTRM

我尝试过的代码：

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from pathlib import Path
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
binary = FirefoxBinary('C:\\Program Files\\Mozilla Firefox\\firefox.exe')
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", "H:\\downloads")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/zip")
driver = webdriver.Firefox(firefox_binary=binary, firefox_profile=fp, executable_path=r'C:\\Program Files\\Python\\Python37\\Lib\\site-packages\\selenium\\webdriver\\firefox\\geckodriver.exe')
driver.set_window_size(1600, 1050)
with open("H:\\downloads\\my_url_list.txt", "r") as f:
    for url in f:
        driver.get(url.strip())
        sleep(5)
        while True:
            # checks whether the element is available on the page, used 'while' instead of 'wait' as I couuldn't figure out the wait time.
            try:
                content = driver.find_element_by_css_selector('div.buttons-block:nth-child(1) > div:nth-child(2)')
                break
            except NoSuchElementException:
                continue
        # used 'execute_script' instead of 'click()' due to "scroll into view error"
        driver.execute_script("arguments[0].click();", content)
        sleep(5)
        while True:
            # checks whether 'filename' element is available on the page, the page shows multiple elements depending on interaction.
            if driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]"):
                filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]").text
                break
            elif driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]"):
                filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]").text
                break
            else:
                sleep(5)
        print(filename)
        dirname = 'H:\\downloads'
        suffix = '.zip'
        file_path = Path(dirname, filename).with_suffix(suffix)
        while True:
            # checks whether the file has downloaded into the folder.
            if os.path.isfile(file_path):
                break

发生了什么事

第一次迭代进行-将文件（链接到url）下载到H:\\downloads文件夹中，并打印filename。

在第二次迭代的情况下，文件被下载到文件夹中，但文件名未打印，所涉及的第二个while循环进入不确定的循环。

第二次运行后不进行迭代，因为无法在第二次迭代中检索到filename，因此循环进入不确定模式。

第二次循环以上代码：

while True:  
            # checks whether 'filename' element is available on the page, the page shows multiple elements depending on interaction.  
            if driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]"):  
                filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]").text  
                break  
            elif driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]"):  
                filename = driver.find_element_by_xpath("/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]").text  
                break  
            else:  
                sleep(5)

为文件名xpath选项附加的图像（为文件名选择两个不同的xpath的原因）

while循环优先选项

while循环第二个选项

Answer 1

您要寻找的是明确的等待，我建议您从Selenium-python文档访问此page。我从页面中引用：

显式等待是您定义的用于等待特定条件的代码在继续进行代码之前发生。极端情况这是time.sleep（），它将条件设置为确切的时间段等待。提供了一些方便的方法来帮助您编写只会等待所需时间的代码。 WebDriver等待与ExpectedCondition结合是可以做到这一点的一种方法完成。

如果您想了解更多关于ExpectedCondition的信息，可以访问此文档link

我建议您使用lambda函数为您的案例提供此代码，因为您正在等待至少一个元素。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    xpath1="/html/body/div[6]/div[3]/div/div[1]/div[4]/div[1]/div/span[1]"
    xpath2="/html/body/div[6]/div[3]/div/div[1]/div[5]/div/div/div[1]/div[1]/div[2]/div[3]/div[1]/span[1]"
    timeLimit = 15 #seconds, you really need to set a time out.
    element = WebDriverWait(driver, timeLimit).until( lambda driver: driver.find_elements(By.xpath, xpath1) or driver.find_elements(By.xpath, xpath2) )
finally:
    pass

此操作最多等待15秒，然后抛出TimeoutException，除非它通过xpath找到了您正在等待的元素之一。默认情况下，WebDriverWait每500毫秒调用一次ExpectedCondition，直到成功返回为止，因此您无需像尝试那样处理逻辑和循环。

要处理TimeoutException，您可以例如刷新页面。

嵌套的for-while循环在第一次运行后停止迭代

1 个答案: