Selenium pdf自动下载无效

时间:2015-05-26 07:39:11

标签: python selenium selenium-webdriver web-scraping web-crawler

我是selenium的新手,我正在编写一个刮刀,以便从给定的站点自动下载pdf文件。

以下是我的代码:

from selenium import webdriver

fp = webdriver.FirefoxProfile()

fp.set_preference("browser.download.folderList",2);
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")

browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/PUBLICATIONS/DM/MADHURAI/2015/05/26/PagePrint//26_05_2015_001_b2b69fda315301809dda359a6d3d9689.pdf");
webobj = browser.find_element_by_id("download").click();

我按照Selenium documentation和此link中提到的步骤进行操作。我不确定为什么每次都会显示下载对话框。

无论如何都要解决它,否则就有办法给予" application / all"这样所有文件都可以下载(解决方法)?

3 个答案:

答案 0 :(得分:7)

禁用内置pdfjs插件并导航到网址 - 将自动下载PDF文件,代码:

from selenium import webdriver

fp = webdriver.FirefoxProfile()

fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")

fp.set_preference("pdfjs.disabled", "true")  # < KEY PART HERE

browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/PUBLICATIONS/DM/MADHURAI/2015/05/26/PagePrint//26_05_2015_001_b2b69fda315301809dda359a6d3d9689.pdf");

更新(对我有用的完整代码):

from selenium import webdriver

mime_types = "application/pdf,application/vnd.adobe.xfdf,application/vnd.fdf,application/vnd.adobe.xdp+xml"

fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", "/home/aafanasiev/Downloads")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", mime_types)
fp.set_preference("plugin.disable_full_page_plugin_for_types", mime_types)
fp.set_preference("pdfjs.disabled", True)

browser = webdriver.Firefox(firefox_profile=fp)
browser.get("http://epaper.dinamalar.com/")

webobj_get_link = browser.find_element_by_id("liSavePdf")
webobj_get_object = webobj_get_link.find_element_by_tag_name("a")
webobj_get_object.click()

答案 1 :(得分:0)

由于没有可用的HTML代码,我的猜测是这一行

webobj = browser.find_element_by_id("download").click();

实际上会调用onclick事件,但您无法正确处理它。换句话说,您缺少的是存储此.pdf文件的位置。我对python编程的经验很少,但一种解决方案可能是使用HTTP webclient lib,这将允许您自动下载文件。像CSharp's WebClient.DownloadFile Method (String, String)这样的东西。如果使用得当,您可以跳过任何Selenium命令来执行此操作。

this post之类的东西可能是一个好的开始。

答案 2 :(得分:0)

我测试了以下代码,并在Windows 7上成功下载了您的pdf:

fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir", download_location)
fp.set_preference("plugin.disable_full_page_plugin_for_types", "application/pdf")
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")



driver = webdriver.Firefox(fp)
driver.implicitly_wait(10)
driver.maximize_window()
driver.get("http://epaper.dinamalar.com/")
element = driver.find_element_by_css_selector("li#liSavePdf>a>img")
element.click()