Question

您好，我的内容如下： -

Property Address: 123 Door         Form Type: Miscellaneous
                  ABC City
                  Pin - XXX

因此，当我使用itextSharp获取内容时，它获得如下 -

Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX

数据是混合的，因为它在下一行。请根据需要建议获取内容的可能方法。谢谢

Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous

Answer 1

使用iTextSharp的以下代码有助于格式化pdf -

PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
    tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
    File.AppendAllLines(outfile, tt, Encoding.UTF8);
}

Answer 2

我正在使用下方帮助程序类将PDF转换为文本文件。这是我的工作蛤。如果任何人需要完整的桌面应用程序，请参考此github存储库 https://github.com/Kithuldeniya/PDFReader

using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;

namespace PDFReader.Helpers
{
    public static class PdfHelper
    {
        public static string ManipulatePdf(string filePath)
        {
            PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));

            //CustomFontFilter fontFilter = new CustomFontFilter(rect);
            FilteredEventListener listener = new FilteredEventListener();

            // Create a text extraction renderer
            LocationTextExtractionStrategy extractionStrategy = listener
                .AttachEventListener(new LocationTextExtractionStrategy());

            // Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
            new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());

            // Get the resultant text after applying the custom filter
            String actualText = extractionStrategy.GetResultantText();

            pdfDoc.Close();

            return actualText;

        }
    }
}

C＃Pdf to Text，其值为多行

2 个答案: