您好,我的内容如下: -
Property Address: 123 Door Form Type: Miscellaneous
ABC City
Pin - XXX
因此,当我使用itextSharp获取内容时,它获得如下 -
Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX
数据是混合的,因为它在下一行。请根据需要建议获取内容的可能方法。谢谢
Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous
答案 0 :(得分:0)
使用iTextSharp的以下代码有助于格式化pdf -
PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
File.AppendAllLines(outfile, tt, Encoding.UTF8);
}
答案 1 :(得分:0)
我正在使用下方帮助程序类将PDF转换为文本文件。这是我的工作蛤。 如果任何人需要完整的桌面应用程序,请参考此github存储库 https://github.com/Kithuldeniya/PDFReader
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
namespace PDFReader.Helpers
{
public static class PdfHelper
{
public static string ManipulatePdf(string filePath)
{
PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));
//CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.AttachEventListener(new LocationTextExtractionStrategy());
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.GetResultantText();
pdfDoc.Close();
return actualText;
}
}
}