Question

我想在我的网络表单中从pdf中提取文本。以下是完美运行的代码，但它也需要不相关的图片，如幻灯片的帧图片或没有数据的图片意味着黑色或白色的空白图片我觉得它也需要图片的背景。

PdfReader reader = new PdfReader(@"E:\Uni_Stuff\waleed 8th semester\DWDM\dwdm011.pdf");
        PRStream pst;
        PdfImageObject pio;
        PdfObject po;
        int n = reader.XrefSize; //number of objects in pdf document
        try
        {
            for (int i = 0; i < n; i++)
            {
                po = reader.GetPdfObject(i); //get the object at the index i in the objects collection
                if (po == null || !po.IsStream()) //object not found so continue
                    continue;
                pst = (PRStream)po; //cast object to stream
                PdfObject type = pst.Get(PdfName.SUBTYPE); //get the object type
                                                           //check if the object is the image type object
                if (type != null && type.ToString().Equals(PdfName.IMAGE.ToString()))
                {

                    pio = new PdfImageObject(pst); //get the image
                    byte[] imgdata = pio.GetImageAsBytes();
                    Image img = new Image();
                    img.ImageUrl = "data:image/jpeg;base64," + Convert.ToBase64String(imgdata);
                    PlaceHolder1.Controls.Add(img);
                }
            }
        }
        catch (Exception ex)
        {
            Response.Write(ex.Message);
        }

现在我只想排除那些不相关的图片。我只想要那些有数据的图片。

Answer 1

我已经在传递中使用了这个库，我相信它可以完成你需要的所有工作。请试一试。

http://www.winnovative-software.com/PdfImgExtractor.aspx

从asp.net中的pdf中提取图像

1 个答案: