Question

我试图使用itextsharp

从pdf文件中提取图像

示例pdf我正在使用here

我使用的代码是： -

static void Main(string[] args)
    {

        try
        {
            WriteImageFile(); // write image file
            System.Console.WriteLine(AppDomain.CurrentDomain.BaseDirectory);
            System.Console.ReadLine();
        }
        catch (Exception ex)
        {
            System.Console.WriteLine(ex.Message);
        }
    }

    private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
    {
        List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

        iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
        iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
        iTextSharp.text.pdf.PdfObject PDFObj = null;
        iTextSharp.text.pdf.PdfStream PDFStremObj = null;

        try
        {
            RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
            PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
            if (PDFReaderObj.IsOpenedWithFullPermissions)
            {
                Debug.Print("this is a test");
            }

            for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
            {
                PDFObj = PDFReaderObj.GetPdfObject(i);

                if ((PDFObj != null) && PDFObj.IsStream())
                {
                    PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                    iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);

                    if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                    {
                        byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);

                        if ((bytes != null))
                        {
                            try
                            {
                                System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);

                                MS.Position = 0;
                                System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);

                                ImgList.Add(ImgPDF);

                            }
                            catch (Exception e)
                            {
                                Console.WriteLine  ("Exception in extract: " + e);
                            }
                        }
                    }
                }
            }
            PDFReaderObj.Close();
        }
        catch (Exception ex)
        {
            throw new Exception(ex.Message);
        }
        return ImgList;
    }


    private static void WriteImageFile()
    {
        try
        {
            System.Console.WriteLine("Wait for extracting image from PDF file....");

            // Get a List of Image
            List<System.Drawing.Image> ListImage = ExtractImages(@"C:\Users\pradyut.bhattacharya\Documents\CEVA PDF\more\CS_75.pdf");

            for (int i = 0; i < ListImage.Count; i++)
            {
                try
                {
                    // Write Image File
                    ListImage[i].Save(@"C:\Users\pradyut.bhattacharya\Documents\CEVA PDF\more\Image" + i + ".jpeg", System.Drawing.Imaging.ImageFormat.Jpeg);
                    System.Console.WriteLine("Image" + i + ".jpeg write sucessfully");
                }
                catch (Exception)
                { }
            }

        }
        catch (Exception ex)
        {
            throw new Exception(ex.Message);
        }
    }

现在在某些情况下我可以获得图像但是对于包含扫描的纸张的大多数PDF，我得到错误： -

    A first chance exception of type 'System.ArgumentException' occurred in System.Drawing.dll
    Exception in extract: System.ArgumentException: Parameter is not valid.
       at System.Drawing.Image.FromStream(Stream stream, Boolean useEmbeddedColorManagement, Boolean validateImageData)
       at System.Drawing.Image.FromStream(Stream stream)
       at ConsoleApplication1.Program.ExtractImages(String PDFSourcePath) in C:\Users\pradyut.bhattacharya\Documents\Visual Studio 

    2010\Projects\ConsoleApplication2\ConsoleApplication2\Program.cs:line 67
    A first chance exception of type 'System.ArgumentException' occurred in System.Drawing.dll
    Exception in extract: System.ArgumentException: Parameter is not valid.
       at System.Drawing.Image.FromStream(Stream stream, Boolean useEmbeddedColorManagement, Boolean validateImageData)
       at System.Drawing.Image.FromStream(Stream stream)
       at ConsoleApplication1.Program.ExtractImages(String PDFSourcePath) in C:\Users\pradyut.bhattacharya\Documents\Visual Studio 

    2010\Projects\ConsoleApplication2\ConsoleApplication2\Program.cs:line 67

任何帮助

由于

Answer 1

PDF中的图像可以以多种方式存储。您的代码适用于.Net Framework具有解码器的所有类型，但对于它没有的解码器将失败。特别是您的代码失败，因为该PDF的图像编码为JBIG2Decode。您可以查看PDFStremObj /FILTER媒体资源来查看此内容。

PdfObject filterType = PDFStremObj.Get(PdfName.FILTER);
if(filterType.Equals(PdfName.JBIG2DECODE)){
    //...
}

对于框架不了解的类型，不幸的是，你需要一个库或编写自己的解码器。

JBIG上的

See this post for some other libraries that do it. Here's Wikipedia's entry如果您想尝试自己动手。而且here's one more post显示了一些可能也支持解码的编码器。

Answer 2

老问题，我知道，但实际上我找到了一个相当不错的解决方案。我也很难从具有JBig2编码的PDF中提取图像。 iTextSharp的新版本（4.1.6后）实际上支持它，但这些版本现在属于AGPL许可证。

使用version 1 of this library by JPedal（版本2不是免费的），您可以将JBig2编码的图像转换为System.Drawing.Bitmap，并根据需要保存/修改它。但，此库只会解码数据，它无法将图像编码成JBig2格式。

一个小但非常小的警告是，该库是Java。这对C＃用户来说完全不是问题，感谢IKVM。如果你还不知道它，IKVM有一个完整的java VM，它运行在.NET中，并具有java类库的本机.NET实现。这很容易设置，我在2小时前就完全测试了这一点。

从上面的链接下载了IKVM和JBig2 jar后，可以执行此命令让IKVM 将 jar转换为本机.NET dll。

ikvmc -target：library [jbig2.jar的路径]

这将输出一个名为jbig2.dll的.NET dll，或者输出到jar或ikvmc可执行文件的同一个目录中（我不记得哪个）。然后，在您的项目中引用jbig2.dll，IKVM.OpenJDK.Core，IKVM.OpenJDK.Media，IKVM.OpenJDK.SwingAWT和IKVM.Runtime。我使用了类似以下的代码来提取图像：

// code to iterate over PDF objects and get bytes of a valid image elided
var imageBytes = GetRawImageBytesFromPdf();

if (filterType.Equals(PdfName.JBIG2DECODE))
{
    var jbg2 = new JBIG2Decoder();

    // Some JBig2 will extract without setting the JBig2Globals
    var decodeParams = stream.GetAsDict(PdfName.DECODEPARMS);
    if(decodeParams != null)
    {
        var globalRef = decodeParams.GetAsIndirectObject(
                                        PdfName.JBIG2GLOBALS);
        if(globalRef != null)
        {
            var globals = PdfReader.GetPdfObject(globalRef);
            var globalStream = globals as PRStream;
            var globalBytes = PdfReader.GetStreamBytesRaw(globalStream);

            if (globalBytes != null)
            {
                jbg2.setGlobalData(globalBytes);
            }
        }
    }

    jbg2.decodeJBIG2(imageBytes);

    var pages = jbg2.getNumberOfPages();

    for(int p = 0; p < pages; p++)
    {
        java.awt.image.BufferedImage bufImg = jbg2.getPageAsBufferedImage(p);

        var bitmap = bufImg.getBitmap();
        bitmap.Save(@"c:\path\to\file.tif", ImageFormat.Tiff);
        // note: I am unsure about the need to free the memory of the internal
        //       bitmap used in the BufferedImage class.  The docs for IKVM and
        //       that class should probably be consulted to find out if that
        //       should be done.
    }
}
// handle other formats like CCITTFAXDECODE

虽然库不是最快的（但这与它在IKVM中使用的事实无关，但是开发人员承认这个库的版本1效率低），它的工作做得很好。我不喜欢编写/编辑java代码，所以如果我想自己提高速度，我想我可能只是直接将它移植到C＃代码。但是，这个java代码at this github project的另一个分支声称速度提高了2.5-4.5倍。您可以编译该jar并使用ikvmc。

希望这有助于任何人仍在寻找解决此问题的方法！

Answer 3

感谢分享这个想法。

他的解决方案是我使用免费版iTextsharper找到的最优雅的解决方案。

正如你所建议的那样，我包括了这些库：

jbig2dec.dll (generated from promt >ikmvc jbig2dec.jar)
ICSharpCode.SharpZipLib
IKVM.Runtime
IKVM.OpenJDK.Core
IKVM.OpenJDK.Media
IKVM.OpenJDK.SwingAWT

使用itextsharp从pdf中提取图像时system.drawing中的异常

3 个答案: