Question

我试图通过PhantomJS（selenium）从链接中保存一些PDF。所以，我指的是。当我运行完全相同的代码时它运行得很好。

所以，我有这个pdf我想从直接网址保存，我尝试了那个脚本......它没有用。它只保存带有1个白页的PDF。这都是......

我的代码：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


def execute(script, args):
    driver.execute('executePhantomScript', {'script': script, 'args' : args })

driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')

try:
    WebDriverWait(driver, 40).until(EC.presence_of_element_located((By.ID, 'plugin')))
except Exception as TimeoutException:
    print("I waited for far too long and still couldn't fine the view.")
    pass

# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])

# render current page
render = '''this.render("test2.pdf")'''
execute(render, [])

我不确定发生了什么，为什么会发生这种情况。需要一些帮助。

编辑：这只是我试图通过Selenium获得的测试PDF。我需要获得一些其他PDF文件，并且网站正在检查上帝知道什么来决定它是人还是机器人。所以，Selenium是唯一的方式。

编辑2：所以，这是我练习的网站：this code that turns webpages to pdfs

选择＆＃34; Cr Rev - 刑事修订＆＃34;来自＆＃34;案例类型＆＃34;下拉并输入案例编号和年份中的任何数字。点击＆＃34; Go＆＃34;。

这将显示一个小桌子，点击＆＃34;查看＆＃34;它应该在整页上显示一个表格。

向下滚动到＆＃34;命令＆＃34;表，您应该看到＆＃34;订单副本＆＃34;。那是我试图获得的PDF格式。我也试过了requests但它没有用。

Answer 1

如果您只是想下载不受某些javascript或其他东西保护的PDF（基本上是简单的东西），我建议改用requests库。

import requests
url ='http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf'
r = requests.get(url)

with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
    f.write(r.content)

# If large file
with requests.get(url, stream=True) as r:
    with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

Answer 2

我建议您查看pdfkit库。

import pdfkit
pdfkit.from_url('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf', 'out.pdf')

使用python下载pdfs非常简单。您还需要下载this才能使库工作。

您还可以尝试下面显示的this链接中的代码

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
     browser.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
     button = browser.find_element_by_name('button')
     button.click()
     # wait for the page to load
     WebDriverWait(browser, timeout=10).until(
         lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
     # store it to string variable
     page_source = browser.page_source
print(page_source)

您需要编辑以便为您的pdf工作。

Answer 3

目前，PhantomJS和Chrome headless不支持下载文件。如果您对Chrome浏览器没问题，请参阅下面的示例。它会找到a个元素，然后添加一个属性download。最后，它点击链接将文件下载到默认的下载文件夹。

import time

driver = webdriver.Chrome()
driver.get('http://www.planetpublish.com/free-ebooks/93/heart-of-darkness/')
pdfLinks = driver.find_elements_by_css_selector(".entry-content ul > li > a")
for pdfLink in pdfLinks:
    script = "arguments[0].setAttribute('download',arguments[1]);"
    driver.execute_script(script, pdfLink, pdfLink.text)
    time.sleep(1)
    pdfLink.click()
    time.sleep(3)

driver.quit()

PhantomJS（Selenium）无法从直接网址加载PDF

3 个答案: