Question

我有一个大约20-25页的PDF文件。该工具的目的是将PDF文件拆分为页面（使用PyPdf2），将每个PDF页面保存在目录中（使用PyPdf2），将PDF页面转换为图像（使用ImageMagick），然后使用tesseract对它们执行一些OCR（使用PIL和PyOCR）提取数据。该工具最终将通过tkinter成为GUI，因此用户可以通过单击按钮多次执行相同的操作。在整个繁重的测试过程中，我注意到，如果将整个过程重复进行大约6至7次，则工具/ python脚本会因为在Windows上显示无响应而崩溃。我已经执行了一些调试，但是不幸的是没有抛出错误。内存和CPU都不错，因此也没有问题。通过观察，在到达tesseract部分之前，PyPDF2和ImageMagick一起运行时出现了故障，因此可以缩小问题的范围。通过将其简化为以下Python代码，我能够复制该问题：

from wand.image import Image as Img
from PIL import Image as PIL
import pyocr
import pyocr.builders
import io, sys, os 
from PyPDF2 import PdfFileWriter, PdfFileReader


def splitPDF (pdfPath):
    #Read the PDF file that needs to be parsed.
    pdfNumPages =0
    with open(pdfPath, "rb") as pdfFile:
        inputpdf = PdfFileReader(pdfFile)

        #Iterate on every page of the PDF.
        for i in range(inputpdf.numPages):
            #Create the PDF Writer Object
            output = PdfFileWriter()
            output.addPage(inputpdf.getPage(i))
            with open("tempPdf%s.pdf" %i, "wb") as outputStream:
                output.write(outputStream)

        #Get the number of pages that have been split.
        pdfNumPages = inputpdf.numPages

    return pdfNumPages

pdfPath = "Test.pdf"
for i in range(1,20):
    print ("Run %s\n--------" %i)
    #Split the PDF into Pages & Get PDF number of pages.
    pdfNumPages = splitPDF (pdfPath)
    print(pdfNumPages)
    for i in range(pdfNumPages):
        #Convert the split pdf page to image to run tesseract on it.
        with Img(filename="tempPdf%s.pdf" %i, resolution=300) as pdfImg:
            print("Processing Page %s" %i)

我已经使用了with语句来正确处理文件的打开和关闭，因此那里应该没有内存泄漏。我试过分别运行分割部分和图像转换部分，它们单独运行时工作良好。但是，如果将代码组合在一起，则在迭代5到6次后，它将失败。我使用了try和exception块，但是没有捕获任何错误。另外，我正在使用所有库的最新版本。任何帮助或指导表示赞赏。

谢谢。

Answer 1

供以后参考，该问题是由于注释之一中提到的ImageMagick的32位版本引起的（感谢emcconville）。卸载Python和ImageMagick 32位版本并安装两个64位版本都可以解决此问题。希望这会有所帮助。

ImageMagick和PyPDF2一起使用会导致Python崩溃

1 个答案: