Question

如何使用PDFSharp从PDF文档中提取FlateDecoded（如PNG）的图像？

我在PDFSharp示例中找到了评论：

// TODO: You can put the code here that converts vom PDF internal image format to a
// Windows bitmap
// and use GDI+ to save it in PNG format.
// [...]
// Take a look at the file
// PdfSharp.Pdf.Advanced/PdfImage.cs to see how we create the PDF image formats.

有没有人能解决这个问题？

感谢您的回复。

编辑：因为我无法在8小时内回答我自己的问题，所以我就这样做了：

感谢您的快速回复。

我在方法“ExportAsPngImage”中添加了一些代码，但是我没有得到想要的结果。它只是提取了一些图像（png），它们没有正确的颜色而且是扭曲的。

这是我的实际代码：

PdfSharp.Pdf.Filters.FlateDecode flate = new PdfSharp.Pdf.Filters.FlateDecode();
        byte[] decodedBytes = flate.Decode(bytes);

        System.Drawing.Imaging.PixelFormat pixelFormat;

        switch (bitsPerComponent)
        {
            case 1:
                pixelFormat = PixelFormat.Format1bppIndexed;
                break;
            case 8:
                pixelFormat = PixelFormat.Format8bppIndexed;
                break;
            case 24:
                pixelFormat = PixelFormat.Format24bppRgb;
                break;
            default:
                throw new Exception("Unknown pixel format " + bitsPerComponent);
        }

        Bitmap bmp = new Bitmap(width, height, pixelFormat);
        var bmpData = bmp.LockBits(new Rectangle(0, 0, width, height), ImageLockMode.WriteOnly, pixelFormat);
        int length = (int)Math.Ceiling(width * bitsPerComponent / 8.0);
        for (int i = 0; i < height; i++)
        {
            int offset = i * length;
            int scanOffset = i * bmpData.Stride;
            Marshal.Copy(decodedBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
        }
        bmp.UnlockBits(bmpData);
        using (FileStream fs = new FileStream(@"C:\Export\PdfSharp\" + String.Format("Image{0}.png", count), FileMode.Create, FileAccess.Write))
        {
            bmp.Save(fs, System.Drawing.Imaging.ImageFormat.Png);
        }

这是正确的方法吗？或者我应该选择另一种方式？非常感谢！

Answer 1

要获得Windows BMP，您只需创建一个Bitmap标头，然后将图像数据复制到位图中。 PDF图像是字节对齐的（每个新行从字节边界开始），而Windows BMP是DWORD对齐的（每个新行都在DWORD边界上开始（由于历史原因，DWORD是4个字节））。您可以在过滤器参数中找到Bitmap标头所需的所有信息，也可以进行计算。

调色板是PDF中的另一个FlateEncoded对象。您也将其复制到BMP中。

必须针对多种格式（每像素1位，8 bpp，24 bpp，32 bpp）进行此操作。

Answer 2

我知道这个答案可能要花几年的时间，但也许会对其他人有所帮助。

在我的情况下，由于image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent)似乎未返回正确的值，因此发生了失真。正如Vive la déraison在您的问题下指出的那样，您将获得使用Marshal.Copy的BGR格式。因此在执行Marshal.Copy之后反转字节并旋转位图就可以了。

结果代码如下：

private static void ExportAsPngImage(PdfDictionary image, ref int count)
    {
        int width = image.Elements.GetInteger(PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfImage.Keys.Height);

        var canUnfilter = image.Stream.TryUnfilter();
        byte[] decodedBytes;

        if (canUnfilter)
        {
            decodedBytes = image.Stream.Value;
        }
        else
        {
            PdfSharp.Pdf.Filters.FlateDecode flate = new PdfSharp.Pdf.Filters.FlateDecode();
            decodedBytes = flate.Decode(image.Stream.Value);
        }

        int bitsPerComponent = 0;
        while (decodedBytes.Length - ((width * height) * bitsPerComponent / 8) != 0)
        {
            bitsPerComponent++;
        }

        System.Drawing.Imaging.PixelFormat pixelFormat;
        switch (bitsPerComponent)
        {
            case 1:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
                break;
            case 8:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed;
                break;
            case 16:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format16bppArgb1555;
                break;
            case 24:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
                break;
            case 32:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format32bppArgb;
                break;
            case 64:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format64bppArgb;
                break;
            default:
                throw new Exception("Unknown pixel format " + bitsPerComponent);
        }

        decodedBytes = decodedBytes.Reverse().ToArray();

        Bitmap bmp = new Bitmap(width, height, pixelFormat);
        BitmapData bmpData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        int length = (int)Math.Ceiling(width * (bitsPerComponent / 8.0));
        for (int i = 0; i < height; i++)
        {
            int offset = i * length;
            int scanOffset = i * bmpData.Stride;
            Marshal.Copy(decodedBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
        }
        bmp.UnlockBits(bmpData);
        bmp.RotateFlip(RotateFlipType.Rotate180FlipNone);
        bmp.Save(String.Format("exported_Images\\Image{0}.png", count++), System.Drawing.Imaging.ImageFormat.Png);
    }

代码可能需要优化，但是在我看来，它确实可以正确导出FlateDecoded图像。

Answer 3

PDF可能包含带有蒙版和不同色彩空间选项的图像，这就是为什么简单地解码图像对象在某些情况下可能无法正常工作的原因。

因此，代码还需要在PDF中检查图像蒙版（/ ImageMask）和图像对象的其他属性（以查看图像是否也应使用反转颜色或使用索引颜色）来重新创建与其显示方式类似的图像用PDF格式。请参阅官方PDF Reference中的Image对象，/ ImageMask和/ Decode词典。

不确定PDFSharp是否能够在PDF中查找图像蒙版对象，但iTextSharp能够访问图像蒙版对象（请参阅PdfName.MASK对象类型）。

像PDF Extractor SDK这样的商业工具能够以原始形式提取图像，并且在渲染时可以提取图像。形成。

我为PDF Extractor SDK的制造商ByteScout工作

Answer 4

这是我执行此操作的完整代码。

我正在从PDF中提取UPS运输标签，所以我事先知道格式。如果您提取的图像是未知类型，那么您需要检查bitsPerComponent并相应地处理它。我也只在第一页处理第一张图片。

注意：我正在使用TryUnfilter来'deflate'，它使用了应用的任何过滤器，并为我就地解码数据。无需明确调用'Deflate'。

    var file = @"c:\temp\PackageLabels.pdf";

    var doc = PdfReader.Open(file);
    var page = doc.Pages[0];

    {
        // Get resources dictionary
        PdfDictionary resources = page.Elements.GetDictionary("/Resources");
        if (resources != null)
        {
            // Get external objects dictionary
            PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
            if (xObjects != null)
            {
                ICollection<PdfItem> items = xObjects.Elements.Values;

                // Iterate references to external objects
                foreach (PdfItem item in items)
                {
                    PdfReference reference = item as PdfReference;
                    if (reference != null)
                    {
                        PdfDictionary xObject = reference.Value as PdfDictionary;
                        // Is external object an image?
                        if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
                        {
                            // do something with your image here 
                            // only the first image is handled here
                            var bitmap = ExportImage(xObject);
                            bmp.Save(@"c:\temp\exported.png", System.Drawing.Imaging.ImageFormat.Bmp);
                        }
                    }
                }
            }
        }
    }

使用这些辅助函数

    private static Bitmap ExportImage(PdfDictionary image)
    {
        string filter = image.Elements.GetName("/Filter");
        switch (filter)
        {
            case "/FlateDecode":
                return ExportAsPngImage(image);

            default:
                throw new ApplicationException(filter + " filter not implemented");
        }
    }

    private static Bitmap ExportAsPngImage(PdfDictionary image)
    {
        int width = image.Elements.GetInteger(PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfImage.Keys.Height);
        int bitsPerComponent = image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent);   

        var canUnfilter = image.Stream.TryUnfilter();
        var decoded = image.Stream.Value;

        Bitmap bmp = new Bitmap(width, height, System.Drawing.Imaging.PixelFormat.Format8bppIndexed);
        BitmapData bmpData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        Marshal.Copy(decoded, 0, bmpData.Scan0, decoded.Length);
        bmp.UnlockBits(bmpData);

        return bmp;
    }

Answer 5

也许没有直接回答这个问题，但从PDF中提取图像的另一个选择是使用FreeSpire.PDF，它可以轻松地从pdf中提取图像。它以Nuget包https://www.nuget.org/packages/FreeSpire.PDF/的形式提供。它们处理所有图像格式，并可以导出为PNG。他们的示例代码是

using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using Spire.Pdf;

namespace ExtractImagesFromPDF
{
    class Program
    {
        static void Main(string[] args)
        {
            //Instantiate an object of Spire.Pdf.PdfDocument
            PdfDocument doc = new PdfDocument();
            //Load a PDF file 
            doc.LoadFromFile("sample.pdf");
            List<Image> ListImage = new List<Image>();
            for (int i = 0; i < doc.Pages.Count; i++)
            {
                // Get an object of Spire.Pdf.PdfPageBase
                PdfPageBase page = doc.Pages[i];
                // Extract images from Spire.Pdf.PdfPageBase
                Image[] images = page.ExtractImages();
                if (images != null && images.Length > 0)
                {
                    ListImage.AddRange(images);
                }

            }
            if (ListImage.Count > 0)
            {
                for (int i = 0; i < ListImage.Count; i++)
                {
                    Image image = ListImage[i];
                    image.Save("image" + (i + 1).ToString() + ".png", System.Drawing.Imaging.ImageFormat.Png);
                }
                System.Diagnostics.Process.Start("image1.png");
            }  
        }
    }
}

（代码取自https://www.e-iceblue.com/Tutorials/Spire.PDF/Spire.PDF-Program-Guide/How-to-Extract-Image-From-PDF-in-C.html）

如何使用PDFSharp从PDF中提取FlateDecoded图像

5 个答案: