Question

我正在使用iTextSharp c＃从目录pdf中提取图像及其名称。我能够从pdf中提取图像，但是要按照所附的屏幕截图提取其对应的图像名称并用该名称保存文件时很费劲。请找到下面的代码，并让我知道您的建议。 样本PDF ：https://docdro.id/PwBsNR9

代码：

private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
{
    List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

    iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
    iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
    iTextSharp.text.pdf.PdfObject PDFObj = null;
    iTextSharp.text.pdf.PdfStream PDFStremObj = null;

    try
    {
        RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
        PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);

        for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
        {
            PDFObj = PDFReaderObj.GetPdfObject(i);

            if ((PDFObj != null) && PDFObj.IsStream())
            {
                PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                }
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                    try
                    {

                        iTextSharp.text.pdf.parser.PdfImageObject PdfImageObj =
                 new iTextSharp.text.pdf.parser.PdfImageObject((iTextSharp.text.pdf.PRStream)PDFStremObj);

                        System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();
                        ImgList.Add(ImgPDF);

                    }
                    catch (Exception)
                    {

                    }
                }
            }
        }
        PDFReaderObj.Close();
    }
    catch (Exception ex)
    {
        throw new Exception(ex.Message);
    }
    return ImgList;
}

Answer 1

我希望这会有所帮助。我正在做这种事情，但是如果有帮助的话。

HTTP Status 500 – Internal Server Error

现在您可以保存您的流。

// existing pdf path
PdfReader reader = new PdfReader(path);
PRStream pst;
PdfImageObject pio;
PdfObject po;
// number of objects in pdf document
int n = reader.XrefSize;
//FileStream fs = null;
// set image file location
//String path = "E:/";
for (int i = 0; i < n; i++)
{
    // get the object at the index i in the objects collection
    po = reader.GetPdfObject(i);
    // object not found so continue
    if (po == null || !po.IsStream())
        continue;
    //cast object to stream
    pst = (PRStream)po;
    //get the object type
    PdfObject type = pst.Get(PdfName.SUBTYPE);
    //check if the object is the image type object
    if (type != null && type.ToString().Equals(PdfName.IMAGE.ToString()))
    {
        //get the image
        pio = new PdfImageObject(pst);
        // fs = new FileStream(path + "image" + i + ".jpg", FileMode.Create);
        //read bytes of image in to an array
        byte[] imgdata = pio.GetImageAsBytes();
        try
        {
            Stream stream = new MemoryStream(imgdata);
            FileStream fs = stream as FileStream;
            if (fs != null) Console.WriteLine(fs.Name);
        }
        catch
        {
        }
    }
}

Answer 2

不幸的是，示例PDF没有被标记。因此，必须通过分析彼此之间的位置或利用内容流中的模式来尝试使标题文本和图像相关联。

在当前情况下，分析彼此的位置是可行的，因为标题总是（至少部分）画在匹配的图像上或文本正下方。因此，可以在第一遍中从页面中提取位置正确的文本，而在第二遍中从图像中提取位置，同时在图像区域或正下方的先前提取的文本中查找标题。或者，可以先提取具有位置和大小的图像，然后提取这些区域中的文本。

但是内容流中也有某种模式：总是在绘制相应图像后立即在单个文本绘制指令中绘制标题。因此，也可以继续进行操作，并一次性提取图像和下一个绘制的文本作为关联的标题。

这两种方法都可以使用iText解析器API来实现。例如，在采用后一种方法的情况下：首先，实现一个行为如描述的渲染侦听器，即保存图像和以下文本：

internal class ImageWithTitleRenderListener : IRenderListener
{
    int imageNumber = 0;
    String format;
    bool expectingTitle = false;

    public ImageWithTitleRenderListener(String format)
    {
        this.format = format;
    }

    public void BeginTextBlock()
    { }

    public void EndTextBlock()
    { }

    public void RenderText(TextRenderInfo renderInfo)
    {
        if (expectingTitle)
        {
            expectingTitle = false;
            File.WriteAllText(string.Format(format, imageNumber, "txt"), renderInfo.GetText());
        }
    }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        imageNumber++;
        expectingTitle = true;

        PdfImageObject imageObject = renderInfo.GetImage();

        if (imageObject == null)
        {
            Console.WriteLine("Image {0} could not be read.", imageNumber);
        }
        else
        {
            File.WriteAllBytes(string.Format(format, imageNumber, imageObject.GetFileType()), imageObject.GetImageAsBytes());
        }
    }
}

然后使用该渲染监听器解析文档页面：

using (PdfReader reader = new PdfReader(@"EVERMOTION ARCHMODELS VOL.78.pdf"))
{
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageWithTitleRenderListener listener = new ImageWithTitleRenderListener(@"EVERMOTION ARCHMODELS VOL.78-{0:D3}.{1}");
    for (var i = 1; i <= reader.NumberOfPages; i++)
    {
        parser.ProcessContent(i, listener);
    }
}

使用iTextSharp从pdf提取图像及其名称

2 个答案: