Question

当通过iTextSharp将PDF中的图像解码为FlateDecode时，图像会失真，我似乎无法弄清楚原因。

识别的bpp是Format1bppIndexed。如果我将PixelFormat修改为Format4bppIndexed，则图像在某种程度上可识别（缩小，着色关闭但可读），并以水平方式重复4次。如果我将像素格式调整为Format8bppIndexed，它在某种程度上也是可识别的，并且以水平方式重复8次。

以下图片采用Format1bppIndexed像素格式方法。不幸的是，由于安全限制，我无法展示其他人。

distorted image

下面的代码如下所示，这实际上是我在SO和网络上遇到的单一解决方案。

int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);

string filter = ((PdfArray)tg.Get(PdfName.FILTER))[0].ToString();
string width = tg.Get(PdfName.WIDTH).ToString();
string height = tg.Get(PdfName.HEIGHT).ToString();
string bpp = tg.Get(PdfName.BITSPERCOMPONENT).ToString();

if (filter == "/FlateDecode")
{
   bytes = PdfReader.FlateDecode(bytes, true);

   System.Drawing.Imaging.PixelFormat pixelFormat;
   switch (int.Parse(bpp))
   {
      case 1:
         pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
         break;
      case 8:
         pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed;
         break;
      case 24:
         pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
         break;
      default:
         throw new Exception("Unknown pixel format " + bpp);
   }

   var bmp = new System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat);
   System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new System.Drawing.Rectangle(0, 0, Int32.Parse(width),
             Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat);
   Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
   bmp.UnlockBits(bmd);
   bmp.Save(@"C:\temp\my_flate_picture-" + DateTime.Now.Ticks.ToString() + ".png", ImageFormat.Png);
}

在处理FlateDecode时，我需要做什么才能使我的图像提取工作正常？

注意：我不想使用其他库来提取图像。我正在寻找一个利用 ONLY iTextSharp和.NET FW的解决方案。如果通过Java（iText）存在解决方案，并且可以轻松移植到.NET FW位，那么就足够了。

UPDATE ：ImageMask属性设置为true，这意味着没有色彩空间，因此隐式为黑白色。当bpp进入1时，PixelFormat应为Format1bppIndexed，如前所述，它会产生上面看到的嵌入图像。

更新：要获取图像大小，我使用Acrobat X Pro将其解压缩，此特定示例的图像大小列为2403x3005。通过iTextSharp提取时，大小列为2544x3300。我将调试器中的图像大小修改为镜像2403x3005但是在调用Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);时我收到异常。

尝试读取或写入受保护的内存。这通常是一个表明其他内存已损坏。

我的假设是，这是由于大小的修改，因此不再对应于正在使用的字节数据。

更新：根据Jimmy的建议，我确认调用PdfReader.GetStreamBytes返回的字节[]长度等于宽度 height / 8，因为GetStreamBytes应该调用{ {1}}。手动调用FlateDecode并调用FlateDecode都会产生一个byte []长度为1049401，而宽度 height / 8为2544 * 3300/8或1049400，所以差异为1不确定这是否是根本原因，是一个人;但是，如果情况确实如此，我不知道如何解决。

更新：在尝试kuujinbo提到的方法时，当我尝试在PdfReader.GetStreamBytes侦听器中呼叫IndexOutOfRangeException时，我遇到了renderInfo.GetImage();。与调用RenderImage时的字节[]长度相比，前面所述的宽度*高度/ 8减去1的事实使我认为这些都是相同的;但是我仍然无法解决这个问题。

FlateDecode

更新：尝试改变原始解决方案中此处列出的各种方法，以及kuujinbo在PDF中使用不同页面提出的解决方案，可生成图像;但是当过滤器类型为at System.util.zlib.Adler32.adler32(Int64 adler, Byte[] buf, Int32 index, Int32 len) at System.util.zlib.ZStream.read_buf(Byte[] buf, Int32 start, Int32 size) at System.util.zlib.Deflate.fill_window() at System.util.zlib.Deflate.deflate_slow(Int32 flush) at System.util.zlib.Deflate.deflate(ZStream strm, Int32 flush) at System.util.zlib.ZStream.deflate(Int32 flush) at System.util.zlib.ZDeflaterOutputStream.Write(Byte[] b, Int32 off, Int32 len) at iTextSharp.text.pdf.codec.PngWriter.WriteData(Byte[] data, Int32 stride) at iTextSharp.text.pdf.parser.PdfImageObject.DecodeImageBytes() at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PdfDictionary dictionary, Byte[] samples) at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PRStream stream) at iTextSharp.text.pdf.parser.ImageRenderInfo.PrepareImageObject() at iTextSharp.text.pdf.parser.ImageRenderInfo.GetImage() at cyos.infrastructure.Core.MyImageRenderListener.RenderImage(ImageRenderInfo renderInfo)并且没有为该给定实例生成图像时，问题总是浮出水面。

Answer 1

尝试逐行复制数据，这可能会解决问题。

int w = imgObj.GetAsNumber(PdfName.WIDTH).IntValue;
int h = imgObj.GetAsNumber(PdfName.HEIGHT).IntValue;
int bpp = imgObj.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue;
var pixelFormat = PixelFormat.Format1bppIndexed;

byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObj);
byte[] decodedBytes = PdfReader.FlateDecode(rawBytes);
byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, imgObj.GetAsDict(PdfName.DECODEPARMS));
// byte[] streamBytes = PdfReader.GetStreamBytes((PRStream)imgObj); // same result as above 3 lines of code.

using (Bitmap bmp = new Bitmap(w, h, pixelFormat))
{
    var bmpData = bmp.LockBits(new Rectangle(0, 0, w, h), ImageLockMode.WriteOnly, pixelFormat);
    int length = (int)Math.Ceiling(w * bpp / 8.0);
    for (int i = 0; i < h; i++)
    {
        int offset = i * length;
        int scanOffset = i * bmpData.Stride;
        Marshal.Copy(streamBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
    }
    bmp.UnlockBits(bmpData);

    bmp.Save(fileName);
}

Answer 2

如果您能够使用最新版本（5.1.3），则使用FlateDecode命名空间简化了用于提取iTextSharp.text.pdf.parser和其他图像类型的API。基本上，您使用PdfReaderContentParser来帮助您解析PDF文档，然后专门（在这种情况下）实现IRenderListener接口来处理图像。这是一个工作示例HTTP处理程序：

<%@ WebHandler Language="C#" Class="bmpExtract" %>
using System;
using System.Collections.Generic;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

public class bmpExtract : IHttpHandler {
  public void ProcessRequest (HttpContext context) {
    HttpServerUtility Server = context.Server;
    HttpResponse Response = context.Response;
    PdfReader reader = new PdfReader(Server.MapPath("./bmp.pdf"));
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    MyImageRenderListener listener = new MyImageRenderListener();
    for (int i = 1; i <= reader.NumberOfPages; i++) {
      parser.ProcessContent(i, listener);
    } 
    for (int i = 0; i < listener.Images.Count; ++i) {
      string path = Server.MapPath("./" + listener.ImageNames[i]);
      using (FileStream fs = new FileStream(
        path, FileMode.Create, FileAccess.Write
      ))
      {
        fs.Write(listener.Images[i], 0, listener.Images[i].Length);
      }
    }         
  }
  public bool IsReusable { get { return false; } }

  public class MyImageRenderListener : IRenderListener {
    public void RenderText(TextRenderInfo renderInfo) { }
    public void BeginTextBlock() { }
    public void EndTextBlock() { }

    public List<byte[]> Images = new List<byte[]>();
    public List<string> ImageNames = new List<string>();
    public void RenderImage(ImageRenderInfo renderInfo) {
      PdfImageObject image = null;
      try {
        image = renderInfo.GetImage();
        if (image == null) return;

        ImageNames.Add(string.Format(
          "Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType()
        ));
        using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes())) {
          Images.Add(ms.ToArray());
        }
      } 
      catch (IOException ie) {
/*
 * pass-through; image type not supported by iText[Sharp]; e.g. jbig2
*/
      }
    }
  }
}

iText [Sharp]开发团队仍在致力于实施，因此我无法确定它是否适用于您的案例。但它确实适用于this simple example PDF。（上面使用过，还有一些我用位图图像试过的其他PDF文件）

编辑：我一直在尝试新的API，并在上面的原始代码示例中犯了一个错误。应该已将PdfImageObject初始化为空 try..catch块。上面的更正。

另外，当我在不支持的图像类型（例如jbig2）上使用上述代码时，我得到一个不同的例外 - “不支持颜色深度XX”，其中“XX”是一个数字。在我尝试的所有示例中，iTextSharp 支持FlateDecode。（但是这并没有帮助你这个的情况，我知道）

PDF是否由第三方软件生成？（非Adobe）从我在本书中看到的，一些第三方供应商生产的PDF不完全符合规范，iText [Sharp]无法处理其中一些PDF，而Adobe产品可以。 IIRC我在iText邮件列表上看到了由Crystal Reports生成的某些PDF特有的案例，这些案例导致了问题，here's one thread。

您是否有任何方法可以使用您正在使用的软件生成测试PDF以及某些非敏感FlateDecode图像？那么也许这里有人可以帮助更好一点。

使用iTextSharp解码为FlateDecode时，为什么我的图像会失真？

2 个答案: