如何在python中使用pdf2image将pdf从url转换为图像?

时间:2019-10-29 08:49:35

标签: python scrape poppler

我可以使用pdf2image convert_to_path将驱动器中的pdf文件转换为图像,但是当我尝试对pdf'https://example.com/abc.pdf'进行相同操作时,最终会出现多个错误。

代码

url = 'https://example.com/abc.pdf'
scrape = urlopen(url)  # for external files
pil_images = pdf2image.convert_from_bytes(scrape.read(), dpi=200, 
             output_folder=None, first_page=None, last_page=None,
             thread_count=1, userpw=None,use_cropbox=False, strict=False,
             poppler_path=r"C:\poppler-0.68.0_x86\poppler-0.68.0\bin",)

错误:

   Unable to get page count. Syntax Error: Document stream is empty

也跟随下面的链接,但没有运气

Python3: Download PDF to memory and convert first page to image

身份验证屏幕截图:

enter image description here

1 个答案:

答案 0 :(得分:1)

首先按照本博客中的说明从URL下载pdf。 https://dzone.com/articles/simple-examples-of-downloading-files-using-python

然后,如果您在pdf中有多个页面,请使用此将pdf格式转换为图像或其他任何格式的文件。

import ghostscript

def pdf2jpeg(pdf_input_path, jpeg_output_path):
    args = ["pdf2jpeg", # actual value doesn't matter
            "-dNOPAUSE",
            "-sDEVICE=jpeg",
            "-r144",
            "-sOutputFile=" + jpeg_output_path,
            pdf_input_path]
    ghostscript.Ghostscript(*args)

参考:Converting a PDF to a series of images with Python

对于身份验证,请尝试此操作。

import os
import requests

from urlparse import urlparse

username = 'foo'
password = 'sekret'

url = 'http://example.com/blueberry/download/somefile.jpg'
filename = os.path.basename(urlparse(url).path)

r = requests.get(url, auth=(username,password))

if r.status_code == 200:
   with open(filename, 'wb') as out:
      for bits in r.iter_content():
          out.write(bits)

引用:Download a file providing username and password using Python