如何使用Python等待等待页面然后下载PDF?

时间:2016-06-21 19:38:44

标签: python http selenium pdf download

问题

我正在尝试从通过crotchety旧主机构建的网站下载PDF文件,并且为了支持网站实现等待页面的流量。等待页面将呈现,您将花费几秒钟来查看它而不是您想要的PDF,然后它将消失,您将前往您想去的地方。

这是我的情景:

  1. 我转到页面。
  2. 也许有33%的时间,我得到了等待页面。这是等待页面代码:
  3. <div id="wrapper">
        <p><hr /></p>
        </p>
            <div id="waiting-main">
                <p style="text-align: center; margin: 6px 0 15px 0;"><img src="/ns_images2/doblogo_1.jpg" border="0" />
                </p>
                <p style="text-align: center; font-size: 30px; line-height: 34px;">Just a moment</p>
                <p style="text-align: left; color: #525252; font-size: 20px; line-height: 22px;">
                Your request is being processed.</br></br>
    
                Due to the high demand it may take a little longer. You will be directed to the page shortly. Please do not leave this page. Refreshing the page will delay the response time. We apologize for the delay.</br></br>
    
                ...[snipped for brevity]...
    
                </p>
    
            </div>
    
        </div>
    
    </body></html>
    
    1. 等待页面退出,我加载以下HTML:
    2. <html><body marginwidth="0" marginheight="0" style="background-color: rgb(38,38,38)"><embed width="100%" height="100%" name="plugin" src="http://a810-bisweb.nyc.gov/bisweb/CofoDocumentContentServlet?passjobnumber=null&amp;cofomatadata1=cofo&amp;cofomatadata2=M&amp;cofomatadata3=000&amp;cofomatadata4=092000&amp;cofomatadata5=M000092531.PDF&amp;requestid=5" type="application/pdf"><div id="annotationContainer"><style>#annotationContainer {    overflow: hidden;     position: absolute;     pointer-events: none;     top: 0;     left: 0;     right: 0;     bottom: 0;     display: -webkit-box;     -webkit-box-align: center;     -webkit-box-pack: center; } .annotation {     position: absolute;     pointer-events: auto; } textarea.annotation {     resize: none; } input.annotation[type='password'] {     position: static;     width: 200px;     margin-top: 100px; } </style></div></body></html>
      
      1. 我在本地下载PDF文档。结束!
      2. 我的尝试解决方案

        不知道selenium真的不支持PDF(或者它是什么?),这是我的方法:

        _driver = webdriver.PhantomJS()
        
        ... 
        req_string = ...[a very long URL]...
        _driver.get(req_str)
        ...
        
        try:
            WebDriverWait(_driver, 10).until(
                # Cannot use:
                # lambda a: not a.presence_of_element_located((By.ID, "waiting-main"))
                # Because:
                # https://blog.mozilla.org/webqa/2012/07/12/how-to-webdriverwait/
                # Which suggests this working alternative.
                lambda s: len(s.find_elements(By.ID, "waiting-main")) == 0
            )
        finally:
            _driver.save_screenshot("test.png") # Maybe?
            # How do I get the actual PDF code? :/
        

        问题

        我看不到用硒做这个的方法。所以我的问题是:

        如何加载页面,等待等待页面,然后使用Python(2.7)下载随后出现的PDF文件?

        或者,如果 可以使用硒,我该怎么办?

        示例

        The link on this page exemplifies my problem.

        解决方法

        现在我正在使用:

        r = requests.get(req_str)
        while "waiting-main" in r.text:
            time.sleep(5)
            r = requests.get(req_str)
        

        还没有关于它如何运作的消息......

        页面

        enter image description here

3 个答案:

答案 0 :(得分:1)

我会忽略等待页面。找到下载页面上存在的特定元素,该元素在等待页面上不存在并等待它。只要确保你等待足够长的时间,等待页肯定会消失(可能是30秒或更长时间?你可能需要尝试一下,看看它是怎么回事。)

从您提供的HTML中,您似乎可以等待EMBED元素。我建议使用WebDriverWait并使用CSS选择器"embed[name='plugin']"

您可以在此处找到有关Selenium等待Python的更多信息:http://selenium-python.readthedocs.io/waits.html

答案 1 :(得分:1)

我可以使用请求一致地获取页面源,这将获得pdf链接并保存它:

from  bs4 import BeautifulSoup
import requests
from urlparse import urljoin

# gets the page when you click the pdf link in your browser
post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"
r = requests.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

soup = BeautifulSoup(r.content)
# parse the form key/value pairs
form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}
# post to from data
nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

# get the link to the pdf to download
pdf = urljoin(base, soup.select_one("iframe")["src"])

# save pdf to file.
with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)

如果您遇到等待问题,可以等到使用selenium显示表单并将源传递给bs4:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def wait(dr, x, t):
    element = WebDriverWait(dr, t).until(
        EC.presence_of_all_elements_located((By.XPATH, x))
    )
    return element

dr = webdriver.PhantomJS()
dr.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

wait(dr, "//form[@action='CofoJobDocumentServlet']", 30)

post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"

soup = BeautifulSoup(dr.page_source)

form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}

nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

pdf = urljoin(base, soup.select_one("iframe")["src"])

with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)

答案 2 :(得分:0)

您需要为PDFS设置下载路径,并添加用于始终在外部打开pdf的选项

driver_path = "path_from_chromedriver"
download_path = "./PdfFolder"
optionsSelenium = Options() // from selenium.webdriver.chrome.options import Options
optionsSelenium.add_experimental_option('prefs',  {
    "download.default_directory": download_path,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True
    }
)
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)

始终显示带有PDF的页面只会下载内容并关闭新标签页