应用错误收集

我以前做了很多次这样的项目。

您需要做的事情：

1。）查看此项目Extract Text from PDF in C#。该项目使用ITextSharp。

如果您下载示例项目并查看其工作方式，那会更好。在这个项目中，它展示了如何从pdf中提取数据。查看PDFParser类，它具有名为 ExtractTextFromPDFBytes（byte [] input）的函数，您可以从该函数中查看如何从未压缩的pdf文件中提取文本。 不要忘记包含ITextSharp dll。

PDFParser类

  1  using System;
  2  using System.IO;
  3  using iTextSharp.text.pdf;
  4  
  5  namespace PdfToText
  6  {
  7      /// 
  8      /// Parses a PDF file and extracts the text from it.
  9      /// 
 10      public class PDFParser 
 11      {
 12          /// BT = Beginning of a text object operator 
 13          /// ET = End of a text object operator
 14          /// Td move to the start of next line
 15          ///  5 Ts = superscript
 16          /// -5 Ts = subscript
 17  
 18          #region Fields
 19  
 20          #region _numberOfCharsToKeep
 21          /// 
 22          /// The number of characters to keep, when extracting text.
 23          /// 
 24          private static int _numberOfCharsToKeep = 15;
 25          #endregion
 26  
 27          #endregion
 28  
 29          #region ExtractText
 30          /// 
 31          /// Extracts a text from a PDF file.
 32          /// 
 33          /// the full path to the pdf file.
 34          /// the output file name.
 35          /// the extracted text
 36          public bool ExtractText(string inFileName, string outFileName)
 37          {
 38              StreamWriter outFile = null;
 39              try
 40              {
 41                  // Create a reader for the given PDF file
 42                  PdfReader reader = new PdfReader(inFileName);
 43                  //outFile = File.CreateText(outFileName);
 44                  outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);
 45                  
 46                  Console.Write("Processing: ");
 47                  
 48                  int     totalLen    = 68;
 49                  float   charUnit    = ((float)totalLen) / (float)reader.NumberOfPages;
 50                  int     totalWritten= 0;
 51                  float   curUnit     = 0;
 52  
 53                  for (int page = 1; page = 1.0f)
 59                      {
 60                          for (int i = 0; i = 1.0f)
 70                          {
 71                              for (int i = 0; i 
104          /// This method processes an uncompressed Adobe (text) object 
105          /// and extracts text.
106          /// 
107          /// uncompressed
108          /// 
109          private string ExtractTextFromPDFBytes(byte[] input)
110          {
111              if (input == null || input.Length == 0) return "";
112  
113              try
114              {
115                  string resultString = "";
116  
117                  // Flag showing if we are we currently inside a text object
118                  bool inTextObject = false;
119  
120                  // Flag showing if the next character is literal 
121                  // e.g. '\\' to get a '\' character or '\(' to get '('
122                  bool nextLiteral = false;
123  
124                  // () Bracket nesting level. Text appears inside ()
125                  int bracketDepth = 0;
126  
127                  // Keep previous chars to get extract numbers etc.:
128                  char[] previousCharacters = new char[_numberOfCharsToKeep];
129                  for (int j = 0; j = ' ') && (c = 128) && (c 
235          /// Check if a certain 2 character token just came along (e.g. BT)
236          /// 
237          /// the searched token
238          /// the recent character array
239          /// 
240          private bool CheckToken(string[] tokens, char[] recent)
241          {
242              foreach(string token in tokens)
243              {
244                  if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
245                      (recent[_numberOfCharsToKeep - 2] == token[1]) &&
246                      ((recent[_numberOfCharsToKeep - 1] == ' ') ||
247                      (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
248                      (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
249                      ((recent[_numberOfCharsToKeep - 4] == ' ') ||
250                      (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
251                      (recent[_numberOfCharsToKeep - 4] == 0x0a))
252                      )
253                  {
254                      return true;
255                  }
256              }
257              return false;
258          }
259          #endregion
260      }
261  }

2。）解析提取的文本并创建和xml文件。

我以前的一些担忧是pdf，其中包含页面内的断开链接或网址。现在，如果你也担心这个问题，正则表达式可以轻松解决你的问题，但我建议你稍后再处理。
现在这里有一个关于如何创建xml的示例代码。了解代码的工作原理，以便稍后了解如何处理自己的代码。

    try {
        //XmlDataDocument sourceXML = new XmlDataDocument();
        string xmlFile = Server.MapPath(“DVDlist.xml”);
        //create a XML file is not exist
        System.Xml.XmlTextWriter writer = new System.Xml.XmlTextWriter(xmlFile, null);
        //starts a new document
        writer.WriteStartDocument();
        //write comments
        writer.WriteComment(“Commentss: XmlWriter Test Program”);
        writer.Formatting = Formatting.Indented;
        writer.WriteStartElement(“DVDlist”);
        writer.WriteStartElement(“DVD”);
        writer.WriteAttributeString(“ID”, “1″);
        //write some simple elements
        writer.WriteElementString(“Title”, “Tere Naam”);
        writer.WriteStartElement(“Starring”);
        writer.WriteElementString(“Actor”, “Salman Khan”);
        writer.WriteEndElement();
        writer.WriteEndElement();
        writer.WriteEndElement();
        writer.Close();
    } 
    catch (Exception e1) { 
        Page.Response.Write(e1); 
    }

希望有所帮助：）

您可以使用iTextSharp等pdf库来查询您的pdf文件。一旦访问了所需的数据，就可以轻松创建xml文件。网上有一些关于如何使用c＃和其他.net语言创建xml文件的信息。如果您有特定问题，请询问;-)

我最终使用Byte Scout's PDF Extractor SDK。它的效果非常好。

看一下pdf2Data http://itextpdf.com/blog/pdf2data-extract-information-invoices-and-templates

它根据模板将pdf文件转换为XML文件。使用选择器定义模板，允许最终用户指定“在第二页上选择表格”或“选择以特定字体书写的文本”等内容。

请记住，我与iText有联系，所以尽管我对PDF的了解很广泛，但我可能会认为我对iText产品有偏见（因为我帮助开发它们）。

使用.NET进行pdf到xml的转换

4 个答案: