Question

我知道这个问题已经出现了十亿次，我想要做的不是Selenium的预期目的，但我不知道其他任何可以达到这个目的的问题。我已经阅读了我最好的能力和大量文档的答案，但我可以使用一些指针。

我正在尝试从CDC Compressed Mortality下载大量文件，这需要一到1）按“我同意”，2）浏览一堆菜单，复选框和下拉框以及3）按'发送'并等待文件自动开始下载。

网页有一些非常麻烦的限制，这使我找到了自动化的方法。

使用“发送”按钮导出生成的数据集与某些设置不一致，省略了数据点，即在某些情况下生成的文件不反映抑制/省略值的设置
页面限制了数据行数

我发现通过各个州输出数据，上述两点不再是问题，但这是超级劳动密集型的，并没有太多乐趣。我应该注意到我没有使用Python（或实际编程）的经验，但是文档对我来说似乎已经足够让我有点工作了。这就是我想做的事情：

导航到页面，按“我接受”
选择州
填写一些选项
点击发送
等待文件完成下载

由于设置Firefox配置文件会跳过下载框，因此文件会自动开始下载。我可以通过查找最新文件并等到 .part 扩展程序消失来确定文件是否已完成下载。

代码一直运行直到它试图选择 12佛罗里达，然后，一切都停止了。 Firefox冻结，没有文件开始下载。手动重复此操作，它没有问题。

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os, unittest, time, re

basedir = os.getcwd()
savedir = os.path.join(basedir, 'download')

# Check download status
def checkdownload():
    os.chdir(savedir)
    files = filter(os.path.isfile, os.listdir(os.getcwd()))
    files = [os.path.join(os.getcwd(), f) for f in files] # add path to each file
    files.sort(key=lambda x: os.path.getmtime(x))
    if not files :
        newest_file = "no"
    else :
        newest_file = files[-1]
    os.chdir(basedir)
    return newest_file



# Set user profile
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",basedir+'\\download')
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/plain")

# Before anything downloads
previousnew = checkdownload()

# Create a new instance of the Firefox driver
b = webdriver.Firefox(firefox_profile=fp)
b.get("http://wonder.cdc.gov/cmf-icd9.html")
b.implicitly_wait(1)

### Find states
b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'

# print [o.text for o in Select(b.find_element_by_id("SD16.V1")).options]

# Make a list of all the states available
options = Select(b.find_element_by_id("codes-D16.V9")).options
optionsList = []

for option in options: 
    optionsList.append(option.get_attribute("value"))
    if option.get_attribute("value") == "*All*":
        optionsList.remove(option.get_attribute("value")) # Remove the *All* option


# Loop over states individually
for optionValue in optionsList:
    print "\nRunning on %s" % optionValue

    b.get("http://wonder.cdc.gov/cmf-icd9.html")
    b.implicitly_wait(1)

    b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'

    print "Add Selections"

    # 1. Table layout, id = SB_1 ... SB_5
    Select(b.find_element_by_id("SB_1")).select_by_visible_text("Age Group")
    Select(b.find_element_by_id("SB_2")).select_by_visible_text("Race")
    Select(b.find_element_by_id("SB_3")).select_by_visible_text("Gender")
    Select(b.find_element_by_id("SB_4")).select_by_visible_text("County")
    Select(b.find_element_by_id("SB_5")).select_by_visible_text("Year")

    # 2. Location, id = codes-D16.V9
    Select(b.find_element_by_id("codes-D16.V9")).deselect_by_index(0) # remove *All* option
    Select(b.find_element_by_id("codes-D16.V9")).select_by_value(optionValue) # selection

    # Age Group, id = SD16.V5
    Select(b.find_element_by_id("SD16.V5")).deselect_by_index(0) # remove *All* option
    Select(b.find_element_by_id("SD16.V5")).select_by_value('20-24')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('25-34')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('35-44')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('45-54')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('55-64')

    # Gender, id = SD16.V7
    # Race, id = SD16.V8
    # Hisp, Does not exist in this file

    # Year, id = SD16.V1
    yr = 1997, 1998
    Select(b.find_element_by_id("SD16.V1")).deselect_by_index(0) # remove *All* option
    select = Select(b.find_element_by_id("SD16.V1"))
    for o in yr:
        select.select_by_value("%s" % o)

    # ICD-9 Codes, id = codes-D16.V2
    # Rate per, id = SO_rate_per

    # Other options
    b.find_element_by_id("export-option").click()
    b.find_element_by_id("CO_show_totals").click()
    b.find_element_by_id("CO_show_zeros").click()
    b.find_element_by_id("CO_show_suppressed").click()

    # Submit
    print "Submit"
    b.find_element_by_xpath("/html/body/div/form/table/tbody/tr/td/div[2]/div[2]/center/input[1]").click()

    # Check if file has begun downloading
    print "Waiting for new file"
    new = checkdownload()
    while previousnew == new:
        print "... waiting"
        new = checkdownload()
        continue

    print "Waiting for download to finish"
    # New file found, wait until it doesn't have .part extension
    new = checkdownload()
    while os.path.splitext(new)[1] == ".part":
        print "... downloading"
        new = checkdownload()
        continue

    print "Downloaded"

    continue


b.quit()

我无法确定为什么会发生这种情况，因为不会产生错误。关于我做错了什么的任何想法？

PS。我意识到我的代码是可怕的，而且一个诚实的答案就是“你做错了什么”。但是，我真的不知道为什么这个简单的脚本表现得像这样。

Answer 1

我运行了你的代码。第一次，由于'\\＆＃39;作为硬编码路径分隔符，但我假设你在Windows上。

修正了第二次，由于竞争条件可能是您的实际问题而失败。看看这些内容：

os.chdir(savedir)
files = filter(os.path.isfile, os.listdir(os.getcwd()))
files = [os.path.join(os.getcwd(), f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))

您正在处理您知道将会消失的文件＆＃34;消失＆＃34; （您正在检查的.part个文件）。如果在listdir和getmtime之间发生这种情况，则getmtime会引发异常，因为该文件不存在且脚本退出而未关闭Firefox（因此它会挂起＆＃34;挂起＆＃34）。这可能是因为文件很小并且下载速度很快。

对文件执行操作时，如果删除文件可能会失败，则需要使用try/catch块，因为无论何时首先检查存在，文件都可能会在检查后立即消失或重命名。但是，这可能需要您使用循环而不是良好的列表推导和排序。

以下是该功能的可能实现：

def checkdownload():
    max_mtime = 0
    newest_file = ""
    for filename in filter(os.path.isfile, os.listdir(savedir)):
        path = os.path.join(savedir, filename)
        try:
            mtime = os.path.getmtime(path)
            if mtime > max_mtime:
                newest_file = path
                max_mtime = mtime
        except OSError:
            pass  # File probably just moved/deleted
    return newest_file

chdir既不是必需的，也不是一个好主意，只需参考您正在使用的目录。
由于您只获取最近更新的文件，因此无需对整个列表进行排序

Selenium导航表单并等待文件完成下载

1 个答案: