我知道这个问题已经出现了十亿次,我想要做的不是Selenium的预期目的,但我不知道其他任何可以达到这个目的的问题。我已经阅读了我最好的能力和大量文档的答案,但我可以使用一些指针。
我正在尝试从CDC Compressed Mortality下载大量文件,这需要一到1)按“我同意”,2)浏览一堆菜单,复选框和下拉框以及3)按'发送'并等待文件自动开始下载。
网页有一些非常麻烦的限制,这使我找到了自动化的方法。
我发现通过各个州输出数据,上述两点不再是问题,但这是超级劳动密集型的,并没有太多乐趣。我应该注意到我没有使用Python(或实际编程)的经验,但是文档对我来说似乎已经足够让我有点工作了。这就是我想做的事情:
由于设置Firefox配置文件会跳过下载框,因此文件会自动开始下载。我可以通过查找最新文件并等到 .part 扩展程序消失来确定文件是否已完成下载。
代码一直运行直到它试图选择 12佛罗里达,然后,一切都停止了。 Firefox冻结,没有文件开始下载。手动重复此操作,它没有问题。
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os, unittest, time, re
basedir = os.getcwd()
savedir = os.path.join(basedir, 'download')
# Check download status
def checkdownload():
os.chdir(savedir)
files = filter(os.path.isfile, os.listdir(os.getcwd()))
files = [os.path.join(os.getcwd(), f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
if not files :
newest_file = "no"
else :
newest_file = files[-1]
os.chdir(basedir)
return newest_file
# Set user profile
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",basedir+'\\download')
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/plain")
# Before anything downloads
previousnew = checkdownload()
# Create a new instance of the Firefox driver
b = webdriver.Firefox(firefox_profile=fp)
b.get("http://wonder.cdc.gov/cmf-icd9.html")
b.implicitly_wait(1)
### Find states
b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'
# print [o.text for o in Select(b.find_element_by_id("SD16.V1")).options]
# Make a list of all the states available
options = Select(b.find_element_by_id("codes-D16.V9")).options
optionsList = []
for option in options:
optionsList.append(option.get_attribute("value"))
if option.get_attribute("value") == "*All*":
optionsList.remove(option.get_attribute("value")) # Remove the *All* option
# Loop over states individually
for optionValue in optionsList:
print "\nRunning on %s" % optionValue
b.get("http://wonder.cdc.gov/cmf-icd9.html")
b.implicitly_wait(1)
b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'
print "Add Selections"
# 1. Table layout, id = SB_1 ... SB_5
Select(b.find_element_by_id("SB_1")).select_by_visible_text("Age Group")
Select(b.find_element_by_id("SB_2")).select_by_visible_text("Race")
Select(b.find_element_by_id("SB_3")).select_by_visible_text("Gender")
Select(b.find_element_by_id("SB_4")).select_by_visible_text("County")
Select(b.find_element_by_id("SB_5")).select_by_visible_text("Year")
# 2. Location, id = codes-D16.V9
Select(b.find_element_by_id("codes-D16.V9")).deselect_by_index(0) # remove *All* option
Select(b.find_element_by_id("codes-D16.V9")).select_by_value(optionValue) # selection
# Age Group, id = SD16.V5
Select(b.find_element_by_id("SD16.V5")).deselect_by_index(0) # remove *All* option
Select(b.find_element_by_id("SD16.V5")).select_by_value('20-24')
Select(b.find_element_by_id("SD16.V5")).select_by_value('25-34')
Select(b.find_element_by_id("SD16.V5")).select_by_value('35-44')
Select(b.find_element_by_id("SD16.V5")).select_by_value('45-54')
Select(b.find_element_by_id("SD16.V5")).select_by_value('55-64')
# Gender, id = SD16.V7
# Race, id = SD16.V8
# Hisp, Does not exist in this file
# Year, id = SD16.V1
yr = 1997, 1998
Select(b.find_element_by_id("SD16.V1")).deselect_by_index(0) # remove *All* option
select = Select(b.find_element_by_id("SD16.V1"))
for o in yr:
select.select_by_value("%s" % o)
# ICD-9 Codes, id = codes-D16.V2
# Rate per, id = SO_rate_per
# Other options
b.find_element_by_id("export-option").click()
b.find_element_by_id("CO_show_totals").click()
b.find_element_by_id("CO_show_zeros").click()
b.find_element_by_id("CO_show_suppressed").click()
# Submit
print "Submit"
b.find_element_by_xpath("/html/body/div/form/table/tbody/tr/td/div[2]/div[2]/center/input[1]").click()
# Check if file has begun downloading
print "Waiting for new file"
new = checkdownload()
while previousnew == new:
print "... waiting"
new = checkdownload()
continue
print "Waiting for download to finish"
# New file found, wait until it doesn't have .part extension
new = checkdownload()
while os.path.splitext(new)[1] == ".part":
print "... downloading"
new = checkdownload()
continue
print "Downloaded"
continue
b.quit()
我无法确定为什么会发生这种情况,因为不会产生错误。关于我做错了什么的任何想法?
PS。我意识到我的代码是可怕的,而且一个诚实的答案就是“你做错了什么”。但是,我真的不知道为什么这个简单的脚本表现得像这样。
答案 0 :(得分:0)
我运行了你的代码。第一次,由于'\\
'作为硬编码路径分隔符,但我假设你在Windows上。
修正了第二次,由于竞争条件可能是您的实际问题而失败。看看这些内容:
os.chdir(savedir)
files = filter(os.path.isfile, os.listdir(os.getcwd()))
files = [os.path.join(os.getcwd(), f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
您正在处理您知道将会消失的文件"消失" (您正在检查的.part
个文件)。如果在listdir
和getmtime
之间发生这种情况,则getmtime
会引发异常,因为该文件不存在且脚本退出而未关闭Firefox(因此它会挂起"挂起& #34)。这可能是因为文件很小并且下载速度很快。
对文件执行操作时,如果删除文件可能会失败,则需要使用try/catch
块,因为无论何时首先检查存在,文件都可能会在检查后立即消失或重命名。但是,这可能需要您使用循环而不是良好的列表推导和排序。
以下是该功能的可能实现:
def checkdownload():
max_mtime = 0
newest_file = ""
for filename in filter(os.path.isfile, os.listdir(savedir)):
path = os.path.join(savedir, filename)
try:
mtime = os.path.getmtime(path)
if mtime > max_mtime:
newest_file = path
max_mtime = mtime
except OSError:
pass # File probably just moved/deleted
return newest_file
chdir
既不是必需的,也不是一个好主意,只需参考您正在使用的目录。
由于您只获取最近更新的文件,因此无需对整个列表进行排序