PDF到文本的转换-下一页中的多行

时间:2020-09-15 07:16:21

标签: c# itext

我的PDF内容如下:

第一页:

Date          Item                     IN          OUT       
17-Oct        Electrical Fan           -           38        
              with RF895 cable
              model XO-8745
              56148
       
17-Oct        Electrical Iron           77          -      
              with ring
              model X12358
              78418
              newline 
:
:
:
17-Oct        Electrical Fan            77          -    

    Note: This receipt is computer generated and no signature is required 

第二页:

Date          Item                     IN          OUT               
              with RF895 cable
              model XO-8745
              56148

17-Oct        Electrical Iron           -          100      
              with ring
              model 54789

              XP-859
              newline 
:
:
:
17-Oct        Electrical Iron           17          -      
              with ring
              
    Note: This receipt is computer generated and no signature is required 

第三页:

Date          Item                     IN          OUT       
              model X12358
              56148
   
17-Oct        Electrical Fan           -           38        
              with RF895 cable
              model XO-8745
              56148
:
:
:
17-Oct        Electrical Fan           108          -        
              with RF895 cable
              model XO-8745
              56148


    Note: This receipt is computer generated and no signature is required   

我使用Itextsharp将数据合并为1行并将其放入excel,因为第二行在下一页中,所以我无法获得我想要的行,因为它只能逐页读取PDF。 代码如下:

if (File.Exists(theFile.FullName))
{
    Console.Write(++count + " " + theFile.FullName);
    PdfReader pdfReader = new PdfReader(theFile.FullName);
    try
    {
        DataTable finalTbl = GetTable();
        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); //Convert to text from PDF
            string[] theLines = currentText.Split(Environment.NewLine.ToCharArray());
            using (StringReader reader = new StringReader(currentText))
            {
                string line;
                while ((line = reader.ReadLine()) != null)
                {
                    string[] splittedTxt = line.Split(new[] { " " },
                        StringSplitOptions.RemoveEmptyEntries);
                    if (splittedTxt.Any())
                    {
                        // create a table
                    }
                    finalTbl.Rows.Add( //add desired datatable)
                }
            }
        }
    }
 }
    catch
   {
   throw;
      }
      finally
  {
   pdfReader.Close();
  }
}

我得到的结果:

17-Oct        Electrical Fan with RF895 cable model XO-8745 56148
17-Oct        Electrical Iron with ring model X12358 78418 newline 
17-Oct        Electrical Fan 
17-Oct        Electrical Iron with ring model 54789  XP-859 newline 
17-Oct        Electrical Iron with ring
17-Oct        Electrical Fan with RF895 cable model XO-8745 56148
17-Oct        Electrical Iron with ring model X12358
17-Oct        Electrical Fan with RF895 cable model XO-8745 56148
17-Oct        Electrical Fan with RF895 cable model XO-8745 56148

在创建数据表之前,有什么方法首先读取和合并所有页面吗?

0 个答案:

没有答案
相关问题