使用Itext从pdf中提取图像

时间:2011-10-21 19:44:33

标签: c# itext

我一直在使用ITEXT函数从pdf文件中读取简单文本,但是可以使用C#中的ITEXT从PDF文件中读取图像

2 个答案:

答案 0 :(得分:3)

你可以试试这样的......

 using iTextSharp.text;
 using iTextSharp.text.pdf;

    public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
    {
        // NOTE:  This will only get the first image it finds per page.
        PdfReader pdf = new PdfReader(sourcePdf);
        RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
        try
        {
            for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
            {
                PdfDictionary pg = pdf.GetPageN(pageNumber);
                PdfDictionary res =
                  (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
                PdfDictionary xobj =
                  (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
                if (xobj != null)
                {
                    foreach (PdfName name in xobj.Keys)
                    {
                        PdfObject obj = xobj.Get(name);
                        if (obj.IsIndirect())
                        {
                            PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
                            PdfName type =
                              (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
                            if (PdfName.IMAGE.Equals(type))
                            {
                                int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
                                PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
                                PdfStream pdfStrem = (PdfStream)pdfObj;
                                byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
                                if ((bytes != null))
                                {
                                    using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
                                    {
                                        memStream.Position = 0;
                                        System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
                                        // must save the file while stream is open.
                                        if (!Directory.Exists(outputPath))
                                            Directory.CreateDirectory(outputPath);

                                        string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
                                        System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
                                        parms.Param[0] = new                                      System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
      // GetImageEncoder is found below this method
                                        System.Drawing.Imaging.ImageCodecInfo jpegEncoder = GetImageEncoder("JPEG");
                                        img.Save(path, jpegEncoder, parms);
                                        break;
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
        catch
        {
            throw;
        }
        finally
        {
            pdf.Close();
        }
    }
    #endregion

   #region GetImageEncoder
    public static System.Drawing.Imaging.ImageCodecInfo GetImageEncoder(string imageType)
    {
        imageType = imageType.ToUpperInvariant();
        foreach (ImageCodecInfo info in ImageCodecInfo.GetImageEncoders())
        {
            if (info.FormatDescription == imageType)
            {
                return info;
            }
        }
        return null;
    }
    #endregion

我希望它会帮助你......

答案 1 :(得分:-2)

嗨,这不是C#,但是我的Java代码我希望你能用它来提取C#中的图像

public ByteArrayOutputStream extractImages(byte[] pdf) throws IOException{
    PdfReader reader = new PdfReader(pdf);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ZipOutputStream zip = new ZipOutputStream(baos);
    MyImageRenderer listener = new MyImageRenderer(zip);
    for(int i=1;i<=reader.getNumberOfPages();i++){
        parser.processContent(i, listener);
    }
    zip=listener.getZip();
    zip.close();
    return baos;
}

MyImageRenderer是一个实现RendererListener接口的类,这是我为渲染图像而编写的方法。

public void renderImage(ImageRenderInfo renderInfo) {
    try {
        PdfImageObject image = renderInfo.getImage();
        if (image == null)
            return;
        ZipEntry entry = new ZipEntry(String.format(img, renderInfo
                .getRef().getNumber(), image.getFileType()));
        System.out.println(image.getFileType());
        zip.putNextEntry(entry);
        zip.write(image.getImageAsBytes());
        zip.closeEntry();
    } catch (IOException ioex) {
        ioex.printStackTrace();
    }
}

我知道这段代码是用Java编写的,但它是为了给你一个大致的想法