合并目录中存在的所有pdf文件

Question

使用Python可以合并单独的PDF文件吗？

假设是这样，我需要进一步扩展它。我希望循环遍历目录中的文件夹并重复此过程。

我可能会推动我的运气，但是可以排除PDF中包含的页面（我的报告生成总是创建一个额外的空白页面）。

Answer 1

较新的PyPdf2库有一个PdfMerger类，可以像这样使用。

示例：

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

append方法似乎需要一个惰性文件对象。也就是说它不会立即读取文件。它会等到调用write方法。如果您使用范围打开（即with），它会将空白页附加到生成的文件中，因为此时输入文件已关闭。

如果文件句柄生存期存在问题，最简单的方法是传递append文件名字符串并允许它处理文件生存期。

即

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")

您可能还想查看作为pypdf2一部分提供的pdfcat脚本。您可以完全避免编写代码。

Answer 2

使用Pypdf或其后继PyPDF2：

作为PDF工具包构建的Pure-Python库。它能够：
  *逐页拆分文件，
  *逐页合并文档，

（还有更多）

这是一个适用于这两个版本的示例程序。

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)

Answer 3

是否可以使用Python合并单独的PDF文件？

是

以下示例将一个文件夹中的所有文件合并为一个新的PDF文件：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os

def merge(path, output_filename):
    output = PdfFileWriter()

    for pdffile in glob(path + os.sep + '*.pdf'):
        if pdffile == output_filename:
            continue
        print("Parse '%s'" % pdffile)
        document = PdfFileReader(open(pdffile, 'rb'))
        for i in range(document.getNumPages()):
            output.addPage(document.getPage(i))

    print("Start writing '%s'" % output_filename)
    with open(output_filename, "wb") as f:
        output.write(f)

if __name__ == "__main__":
    parser = ArgumentParser()

    # Add more options if you like
    parser.add_argument("-o", "--output",
                        dest="output_filename",
                        default="merged.pdf",
                        help="write merged PDF to FILE",
                        metavar="FILE")
    parser.add_argument("-p", "--path",
                        dest="path",
                        default=".",
                        help="path of source PDF files")

    args = parser.parse_args()
    merge(args.path, args.output_filename)

Answer 4

合并目录中存在的所有pdf文件

将pdf文件放在目录中。启动该计划。你得到一个pdf，所有pdf合并。

import os
from PyPDF2 import PdfFileMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfFileMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

Answer 5

pdfrw library可以非常轻松地完成此操作，假设您不需要保留书签和注释，并且您的PDF未加密。 cat.py是一个示例连接脚本，subset.py是一个示例页面子集脚本。

串联脚本的相关部分 - 假设*a是输入文件名列表，inputs是输出文件名：

outfn

从这一点可以看出，将最后一页放在一边很容易，例如类似的东西：

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

免责声明：我是主要的writer.addpages(PdfReader(inpfn).pages[:-1])作者。

Answer 6

这里，http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/，给出了解决方案。

类似地：

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()

append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)

    output.write(file("c:\\combined.pdf","wb"))

Answer 7

<button class="mdc-fab app-fab--absolute"  
     [mdePopoverTriggerFor]="appPopover"
      mdePopoverTriggerOn="click" aria-label="Favorite" mat-fab ><mat-icon>
     <span class="mdc-fab__icon material-icons">textsms</span> 
  </mat-icon></button>   

  <mde-popover #appPopover="mdePopover" [mdePopoverOverlapTrigger]="false">

      <input type ="text"......./>

      <button (click)="submit()">submit</button>
      <button (click)="cancel()">cancel</button>
  </mde-popover>

Git回购：https://github.com/mahaguru24/Python_Merge_PDF.git

Answer 8

使用字典进行细微的改动以提高灵活性（例如，sort，dedup）：

import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
    for file in files:
        filepath = subdir + os.sep + file
        # you can have multiple endswith
        if filepath.endswith((".pdf", ".PDF")):
            file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)

for k, v in file_dict.items():
    print(k, v)
    merger.append(v)

merger.write("combined_result.pdf")

Answer 9

也可以使用Aspose.PDF Cloud SDK for Python。这是一个快速的example：

#upload PDF files to aspose cloud storage
storageApi.PutCreate(file1, null, null, path1)
storageApi.PutCreate(file2, null, null, path2)

#merge files into one PDF
pdfApi.PutMergeDocuments(name, null, null, mergeDocumentsBody)

#download merged PDF from storage server
storageApi.GetDownload(name)

最大的好处是API提供了许多其他可能性来管理您的PDF。您可以修改，转换，加密文件；处理页面，文本，形状和其他元素。

注意：我是Aspose的开发人员布道者。

Answer 10

我通过利用子进程在Linux终端上使用pdf unite（假定目录中存在one.pdf和two.pdf），目的是将它们合并为3.pdf

 import subprocess
 subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)

Answer 11

Giovanni G. PY 以一种易于使用的方式（至少对我而言）的回答：

import os
from PyPDF2 import PdfFileMerger

def merge_pdfs(export_dir, input_dir, folder):
    current_dir = os.path.join(input_dir, folder)
    pdfs = os.listdir(current_dir)
    
    merger = PdfFileMerger()
    for pdf in pdfs:
        merger.append(open(os.path.join(current_dir, pdf), 'rb'))

    with open(os.path.join(export_dir, folder + ".pdf"), "wb") as fout:
        merger.write(fout)

export_dir = r"E:\Output"
input_dir = r"E:\Input"
folders = os.listdir(input_dir)
[merge_pdfs(export_dir, input_dir, folder) for folder in folders];

Answer 12

以下是针对我的特定用例的最常见答案的时间比较：组合 5 个大型单页 pdf 文件的列表。每个测试我都跑了两次。

（免责声明：我在 Flask 中运行此功能，您的里程可能会有所不同）

TL;DR

pdfrw 是我测试的 3 个中组合 pdf 的最快的库。

PyPDF2

start = time.time()
merger = PdfFileMerger()
for pdf in all_pdf_obj:
    merger.append(
        os.path.join(
            os.getcwd(), pdf.filename # full path
                )
            )
formatted_name = f'Summary_Invoice_{date.today()}.pdf'
merge_file = os.path.join(os.getcwd(), formatted_name)
merger.write(merge_file)
merger.close()
end = time.time()
print(end - start) #1 66.50084733963013 #2 68.2995400428772

PyMuPDF

start = time.time()
result = fitz.open()

for pdf in all_pdf_obj:
    with fitz.open(os.path.join(os.getcwd(), pdf.filename)) as mfile:
        result.insertPDF(mfile)
formatted_name = f'Summary_Invoice_{date.today()}.pdf'

result.save(formatted_name)
end = time.time()
print(end - start) #1 2.7166640758514404 #2 1.694727897644043

pdfrw

start = time.time()
result = fitz.open()

writer = PdfWriter()
for pdf in all_pdf_obj:
    writer.addpages(PdfReader(os.path.join(os.getcwd(), pdf.filename)).pages)

formatted_name = f'Summary_Invoice_{date.today()}.pdf'
writer.write(formatted_name)
end = time.time()
print(end - start) #1 0.6040127277374268 #2 0.9576816558837891

合并PDF文件

12 个答案:

合并目录中存在的所有pdf文件

TL;DR

PyPDF2

PyMuPDF

pdfrw