Question

我使用以下方法逐行提取pdf文本。但问题是，它不是在文字和数字之间阅读空格。什么可以解决这个问题？

我只想创建一个字符串列表，list对象中的每个字符串都有一个pdf文本行，因为它是pdf，包括空格。

public void readtextlinebyline(string filename)   {


        List<string> strlist = new List<string>();
        PdfReader reader = new PdfReader(filename);
        string text = string.Empty;
        for (int page = 1; page <= 1; page++)
        {

            text += PdfTextExtractor.GetTextFromPage(reader, page ,new LocationTextExtractionStrategy())+" ";

        }
        reader.Close();
        string[] words = text.Split('\n');
        foreach (string word in words)
        {
            strlist.Add(word);
        }

        foreach (string st in strlist)
        {
            Response.Write(st +"<br/>");
        }

   }

我已经通过将策略更改为SimpleTextExtractionStrategy来尝试此方法，但它也不适用于我。

Answer 1

在this answer to "itext java pdf to text creation"中已经解释了iText（夏普）或其他PDF文本提取器无法正确识别单词之间空格的背景：这些'空格'不一定是使用空格字符创建的，而是使用空格字符创建的使用创建小间隙的操作。但是，这些操作也用于其他目的（不会破坏单词），因此文本提取器必须使用启发式方法来判断这样的间隙是否是单词中断...

这尤其意味着您永远不会获得100％安全的断字检测。

但是，您可以做的是改进所使用的启发式方法。

iText和iTextSharp标准文本提取策略，例如如果

，则假设一行中断

a）有空格字符或

b）存在至少与半个空格字符一样宽的间隙。

项目a肯定会受到影响，但在密集设置文本的情况下，项目b可能经常失败。 answer referenced above问题的OP使用空格字符宽度的四分之一得到了相当好的结果。

您可以通过复制和更改您选择的文本提取策略来调整这些条件。

在SimpleTextExtractionStrategy中，您发现renderText方法中嵌入了此标准：

if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
    AppendTextChunk(' ');
}

如果LocationTextExtractionStrategy这个标准同时被放入了自己的方法中：

/**
 * Determines if a space character should be inserted between a previous chunk and the current chunk.
 * This method is exposed as a callback so subclasses can fine tune the algorithm for determining whether a space should be inserted or not.
 * By default, this method will insert a space if the there is a gap of more than half the font space character width between the end of the
 * previous chunk and the beginning of the current chunk.  It will also indicate that a space is needed if the starting point of the new chunk 
 * appears *before* the end of the previous chunk (i.e. overlapping text).
 * @param chunk the new chunk being evaluated
 * @param previousChunk the chunk that appeared immediately before the current chunk
 * @return true if the two chunks represent different words (i.e. should have a space between them).  False otherwise.
 */
protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
    float dist = chunk.DistanceFromEndOf(previousChunk);
    if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
        return true;
    return false;
}

将其置于自己的方法中的意图仅仅是要求对策略进行简单的子类化并重写此方法以调整启发式标准。这在等效的iText Java类的情况下工作正常，但在端口到iTextSharp期间，遗憾的是没有virtual添加到声明中（从版本5.4.4开始）。因此，iTextSharp目前仍然需要复制整个策略。

@Bruno你可能想告诉iText - ＆gt;关于此，iTextSharp移植团队。

虽然您可以在这些代码位置微调文本提取，但您应该知道在这里找不到100％的标准。一些原因是：

密集设置文本中的单词之间的间隙可能比单词内部的某些光学效果的字距调整或其他间隙小。因此，这里没有一个通用的因素。
在完全不使用空格字符的PDF中（因为你总是可以使用间隙，这是可能的），“空格字符的宽度”可能是一些随机值或根本无法确定！
有一些有趣的PDF滥用空格字符宽度（可以随时单独拉伸以进行操作）在使用间隙进行分词时进行一些表格格式化。在这样的PDF中，空格字符的当前宽度值不能用于确定单词中断。
有时候你会发现一条印在一条线上的单词以便强调。这些可能会被大多数启发式解析为一个单字母单词的集合。

你可以通过考虑所有角色之间的实际视觉自由空间（使用PDF渲染或字体信息分析机制），比iText启发式以及使用其他常量从中获得的更好，但是为了实现可感知的改进，你必须花很多时间。

Answer 2

我有自己的实现，而且效果很好。

    /// <summary>
    /// Read a PDF file and returns the string content.
    /// </summary>
    /// <param name="par">ByteArray, MemoryStream or URI</param>
    /// <returns>FileContent.</returns>
    public static string ReadPdfFile(object par)
    {
        if (par == null) throw new ArgumentNullException("par");

        PdfReader pdfReader = null;
        var text = new StringBuilder();

        if (par is MemoryStream)
            pdfReader = new PdfReader((MemoryStream)par);
        else if (par is byte[])
            pdfReader = new PdfReader((byte[])par);
        else if (par is Uri)
            pdfReader = new PdfReader((Uri)par);

        if (pdfReader == null)
            throw new InvalidOperationException("Unable to read the file.");

        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }

        pdfReader.Close();

        return text.ToString();
    }

Answer 3

using (PdfReader reader = new PdfReader(path))
            {
                StringBuilder text = new StringBuilder();
                StringBuilder textfinal = new StringBuilder();
                String page = "";
                for (int i = 1; i <= reader.NumberOfPages; i++)
                {
                    text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                    page = PdfTextExtractor.GetTextFromPage(reader, i);
                    string[] lines = page.Split('\n');
                    foreach (string line in lines)
                    {
                        string[] words = line.Split('\n');
                        foreach (string wrd in words)
                        {

                        }
                        textfinal.Append(line);
                        textfinal.Append(Environment.NewLine); 
                    }
                    page = "";
                }
           }

我们如何使用带空格的itextsharp从pdf中提取文本？

3 个答案: