Question

我使用了LocationTextExtractionStrategy的扩展版本来提取pdf的连接文本及其位置/大小。我通过使用locationalResult来做到这一点。这很有效，直到我测试了包含不同字体（ttf）的pdf文本。突然间，这些文本被分成单个字符或小片段。

例如，“Detail”不再是locationalResult列表中的一个对象，而是分成六个项目（D，e，t，a，i，l）

我尝试将getLocationalResult方法设为公共使用HorizontalTextExtractionStrategy：

public List<TextChunk> GetLocationalResult()
{
    return (List<TextChunk>)locationalResultField.GetValue(this);
}

并使用PdfReaderContentParser提取文本：

reader = new PdfReader("some_pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(i, HorizontalTextExtractionStrategy());

foreach (HorizontalTextExtractionStrategy.HorizontalTextChunk chunk in strategy.GetLocationalResult())
{
    // Do something with the chunk     
}

但这也会返回相同的结果。有没有其他方法从PDF格式提取连接文本？

Answer 1

我使用locationalResult的扩展版本来提取pdf的连接文本及其位置/大小。我是使用LocationTextExtractionStrategy.locationalResult完成的。这很有效，直到我测试了包含不同字体（ttf）的文本的pdf。突然间，这些文本被分成单个字符或小片段。

此问题是由于对TextChunk私有列表成员变量的内容的错误预期。

这个TextChunk实例列表包含从解析框架转发到策略的文本片段（或者可能是因为它们被某些过滤器类预处理），并且框架转发它遇到的每个单独的字符串分别在内容流中。

因此，如果内容流中看似连接的单词实际上是使用多个字符串绘制的，那么您将获得多个getResultantText个实例。

实际上有一些＆＃34;情报＆＃34;在正确连接这些块的方法[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ中，在必要时添加空格等等。

如果是您的文件，＆＃34; DETAIL＆＃34;通常是这样绘制的：

TextChunk

正如您所看到的，在＆＃39; D＆＃39;之间有轻微的文字插入点移动。和＆＃39; E＆＃39;＆＃39; T＆＃39;和＆＃39; A＆＃39;，＆＃39;我＆＃39;和＆＃39; L＆＃39;和＆＃39; L＆＃39;和＆＃39; ＆＃39 ;. （这种迷你移动通常代表字距调整。）因此，您将获得“D＆＃39;”，“E＆＃39;”，“AI＆＃39”的个人LocationTextExtractionStrategy.locationalResult个实例。 ;和＆＃39; L＆＃39;。

不可否认，HorizontalTextExtractionStrategy成员的记录不是很好;但由于它是私人会员，这个恕我直言是可以原谅的。

许多文档的运行良好是由于许多PDF创建者没有应用字距调整，只是使用单个字符串对象绘制连接文本。

LocationTextExtractionStrategy派生自TextChunk，与将TextChunk个实例排列为单个字符串的方式主要不同。因此，您将在此处看到相同的碎片。

有没有其他方法从pdf中提取连接文本？

如果你想要＆＃34;连接文本＆＃34;就像在内容流＆＃34;中的原子串对象一样，你已经有了它们。

如果你想要＆＃34;连接文本＆＃34;在＆＃34;视觉上连接的文本中，无论在内容流中绘制组成字母的位置如何，您都必须将LocationTextExtractionStrategy个实例粘贴在一起，如HorizontalTextExtractionStrategy和{{1} getResultantText中的TextChunkLocationDefaultImp与各自HorizontalTextChunkLocation和app.service('StoreService', ['$http', function ($http) { this.getStoreNamesService = function () { return $http.get('http://localhost:8080/storys'); }; }]); app.controller('ItemFormController', ['$scope', '$http', '$mdDialog', 'itemService', 'StoreService', function ($scope, $http, $mdDialog, itemService, StoreService) { StoreService.getStoreNamesService().then(function (response, status) { $scope.storeNames = response.data; }); }]);实施中的比较方法相结合。

Answer 2

在对iTextSharp库进行深入调试后，我发现我的文本是用mkl提到的TJ算子绘制的。

[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ

iText不是将这些文本作为单个PdfString处理，而是作为PdfObjects的数组处理，最终为其中的每个renderListener.RenderText(renderInfo)项调用PdfString（请参阅{{ 3}} class和ShowTextArray方法）。然而，在RenderText方法中，有关数组中pdf字符串关系的信息丢失，并且每个项目都作为独立对象添加到locationalResult。

我的目标是提取单个文本绘图指令的＆＃34;参数＆＃34;我扩展了PdfContentStreamProcessor类关于新方法ProcessTexts的方法，该方法返回这些原子字符串的列表。我的解决方法不是很漂亮，因为我必须从DisplayPdfString复制粘贴一些私有字段和方法，但它对我有用。

class PdfContentStreamProcessorEx : PdfContentStreamProcessor
{
    private IDictionary<int, CMapAwareDocumentFont> cachedFonts = new Dictionary<int, CMapAwareDocumentFont>();
    private ResourceDictionary resources = new ResourceDictionary();
    private CMapAwareDocumentFont font = null;

    public PdfContentStreamProcessorEx(IRenderListener renderListener) : base(renderListener)
    {
    }

    public List<string> ProcessTexts(byte[] contentBytes, PdfDictionary resources)
    {
        this.resources.Push(resources);
        var texts = new List<string>();
        PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(contentBytes)));
        PdfContentParser ps = new PdfContentParser(tokeniser);
        List<PdfObject> operands = new List<PdfObject>();
        while (ps.Parse(operands).Count > 0)
        {
            PdfLiteral oper = (PdfLiteral)operands[operands.Count - 1];
            if ("Tj".Equals(oper.ToString()))
            {
                texts.Add(getText((PdfString)operands[0]));
            }
            else if ("TJ".Equals(oper.ToString()))
            {
                string text = string.Empty;
                foreach (PdfObject entryObj in (PdfArray)operands[0])
                {
                    if (entryObj is PdfString)
                    {
                        text += getText((PdfString)entryObj);
                    }
                }
                texts.Add(text);
            }
            else if ("Tf".Equals(oper.ToString()))
            {
                PdfName fontResourceName = (PdfName)operands[0];
                float size = ((PdfNumber)operands[1]).FloatValue;

                PdfDictionary fontsDictionary = resources.GetAsDict(PdfName.FONT);
                CMapAwareDocumentFont _font;
                PdfObject fontObject = fontsDictionary.Get(fontResourceName);
                if (fontObject is PdfDictionary)
                    _font = GetFont((PdfDictionary)fontObject);
                else
                    _font = GetFont((PRIndirectReference)fontObject);

                font = _font;
            }
        }

        this.resources.Pop();

        return texts;
    }

    string getText(PdfString @in)
    {
        byte[] bytes = @in.GetBytes();
        return font.Decode(bytes, 0, bytes.Length);
    }

    private CMapAwareDocumentFont GetFont(PRIndirectReference ind)
    {
        CMapAwareDocumentFont font;
        cachedFonts.TryGetValue(ind.Number, out font);
        if (font == null)
        {
            font = new CMapAwareDocumentFont(ind);
            cachedFonts[ind.Number] = font;
        }
        return font;
    }

    private CMapAwareDocumentFont GetFont(PdfDictionary fontResource)
    {
        return new CMapAwareDocumentFont(fontResource);
    }

    private class ResourceDictionary : PdfDictionary
    {
        private IList<PdfDictionary> resourcesStack = new List<PdfDictionary>();

        virtual public void Push(PdfDictionary resources)
        {
            resourcesStack.Add(resources);
        }

        virtual public void Pop()
        {
            resourcesStack.RemoveAt(resourcesStack.Count - 1);
        }

        public override PdfObject GetDirectObject(PdfName key)
        {
            for (int i = resourcesStack.Count - 1; i >= 0; i--)
            {
                PdfDictionary subResource = resourcesStack[i];
                if (subResource != null)
                {
                    PdfObject obj = subResource.GetDirectObject(key);
                    if (obj != null) return obj;
                }
            }
            return base.GetDirectObject(key); // shouldn't be necessary, but just in case we've done something crazy
        }
    }
}

iText LocationTextExtractionStrategy / HorizontalTextExtractionStrategy将文本拆分为单个字符

2 个答案:

iText LocationTextExtractionStrategy / Horizo​​ntalTextExtractionStrategy将文本拆分为单个字符

2 个答案:

iText LocationTextExtractionStrategy / HorizontalTextExtractionStrategy将文本拆分为单个字符