Question

我需要能够将包含不同长度的许多文档的大型扫描pdf图像文件分离为单独的PDF文件。

我知道这样做的一种方法是在每次扫描所有文档之前在每个文档之间包含一个分隔页。通常，这是通过在读取的分隔页上使用条形码来完成的，然后在检测到时创建新的PDF文件。

我更愿意在.net中这样做，但我愿意接受其他建议。我在这个网站上看了几个流行的图书馆 - itextsharp和pdfsharp。我找不到任何PDF文件被拆分成不同页数的PDF，只有固定长度的例子。

我不确定这些图书馆是否可行，是否有人对替代方案有任何想法或是否有可能？

Answer 1

我处于同样的情况，找到了ByteScout

提供的解决方案

下载BarCodeReader.dll后的示例代码将是

using System;
using System.IO;
using System.Linq;
using System.Text;
using Bytescout.BarCodeReader;

namespace SplitByBarcode
{
    class Program
    {
        static void Main(string[] args)
        {
            string inputFile = @"abc.pdf";

            Console.WriteLine("Processing file " + inputFile);

            using (Reader reader = new Reader())
            {
                reader.RegistrationName = "demo";
                reader.RegistrationKey = "demo";

                reader.BarcodeTypesToFind.Code128 = true; // EAN-128 is the same as Code 128
                reader.PDFRenderingResolution = 96;

                FoundBarcode[] barcodes = reader.ReadFrom(inputFile);
                Console.WriteLine("Found " + barcodes.Length + " barcodes");

                if (barcodes.Length > 0)
                {
                    StringBuilder pageRanges = new StringBuilder();

                    // Create string containing page ranges to extract in the form "1-4,6-8,10-11,12-"
                    for (int i = 0; i < barcodes.Length; i++)
                    {
                        FoundBarcode barcode = barcodes[i];
                        pageRanges.Append(barcode.Page + 2); // +1 because we skip the page with barcode and another +1 because need 1-based page numbers
                        pageRanges.Append("-");
                        if (i < barcodes.Length - 1)
                        {
                            pageRanges.Append(barcodes[i + 1].Page);
                            pageRanges.Append(",");
                        }
                    }

                    Console.WriteLine("Extracting page ranges " + pageRanges);

                    // Split document 
                    string[] splittedParts = reader.SplitDocument(inputFile, pageRanges.ToString());

                    // Rename parts according to barcode values
                    for (int i = 0; i < splittedParts.Length; i++)
                    {
                        string fileName = barcodes[i].Value + ".pdf";

                        File.Delete(fileName);
                        File.Move(splittedParts[i], fileName);

                        Console.WriteLine("Saved file " + fileName);
                    }
                }
            }

            Console.WriteLine("Press any key to continue...");
            Console.ReadKey();
        }
    }
}

希望它会有所帮助

Answer 2

目前还不清楚你想做什么，但这是一种阅读文件src的方法，选择第1-10页，并创建一个仅包含这些页面的文件dest： p>

PdfReader reader = new PdfReader(src);
reader.SelectPages("1-10");
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create);
stamper.Close();

另一种方法是使用PdfCopy。再次创建一个reader对象：

PdfReader reader = new PdfReader(src);

现在，您可以使用此阅读器对象创建不同的文件，其中start和end是您要开始和结束的页码。

FileStream fs = new FileStream(dest, FileMode.Create);
using (Document document = new Document()) {
    using (PdfCopy copy = new PdfCopy(document, fs)) {
        document.Open();
        for (int i = start; i < end;) {
            copy.AddPage(copy.GetImportedPage(reader, i++));
        }
    }
}

这一切都记录在我的书中，更具体地说是chapter 6 (free download)。

由于您可以选择页面范围，因此可以将具有X页的文档拆分为具有不同页数的Y文档。显然，您必须自己定义每个单独文档的页数。 iTextSharp，PdfSharp等库...将每个扫描的页面视为图像，但不解释该页面上的内容。引入带有条形码的页面并没有多大意义。但是：如果您在每个第一页上添加注释（注释是PDF中的交互式对象，不是您添加到页面中的内容），那么iText可以根据这些位置拆分文档在哪里找到这样的注释。

将PDF拆分为不同长度的页面

2 个答案: