C#iTextSharp - 代码覆盖而不是附加页面

时间:2015-01-12 16:24:50

标签: c# pdf itextsharp

我看过很多帖子帮助我了解了我的情况,我是编程新手。我的目的是获取目录中的文件" sourceDir"并寻找正则表达式匹配。当它找到匹配时,我想创建一个以匹配作为名称的新文件。如果代码找到具有相同匹配的另一个文件(该文件已存在),则在该文档中创建一个新页面。

现在代码可以工作,但是它不会添加新页面,而是覆盖文档的第一页。注意:目录中的每个文档只有一页!

string sourceDir = @"C:\Users\bob\Desktop\results\";
string destDir = @"C:\Users\bob\Desktop\results\final\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
    {
       using (var pdfReader = new PdfReader(file.ToString()))
            {
                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    var text = new StringBuilder();

                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    var currentText = 
                    PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);

                    Regex reg = new Regex(@"ABCDEFG");
                    MatchCollection matches = reg.Matches(currentText);

                    foreach (Match m in matches)
                    {
                        string newFile = destDir + m.ToString() + ".pdf";

                        if (!File.Exists(newFile))
                        {
                            using (PdfReader reader = new PdfReader(File.ReadAllBytes(file)))
                            {
                                using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
                                {
                                    using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.Create)))
                                    {
                                        var importedPage = copy.GetImportedPage(reader, page);
                                        doc.Open();
                                        copy.AddPage(importedPage);
                                        doc.Close();
                                    }
                                }
                            }
                        }
                        else
                        {
                            using (PdfReader reader = new PdfReader(File.ReadAllBytes(newFile)))
                            {
                                using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
                                {
                                    using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.OpenOrCreate)))
                                    {
                                        var importedPage = copy.GetImportedPage(reader, page);
                                        doc.Open();
                                        copy.AddPage(importedPage);
                                        doc.Close();
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }

2 个答案:

答案 0 :(得分:2)

Bruno做了很好的解释问题以及如何解决这个问题,但是因为你已经说过你是编程的新手并且你已经进一步posted a very similar and related question我要走了希望能帮助你更深一点。

首先,让我们记下这些知识:

  1. 这是一个充满PDF的目录
  2. 每个PDF只有一个页面
  3. 然后目标:

    1. 提取每个PDF的文本
    2. 将提取的文本与模式进行比较
    3. 如果匹配,则使用匹配文件名执行以下操作之一:
      1. 如果存在文件,则将源PDF附加到其中
      2. 如果没有匹配项,请使用PDF
      3. 创建一个新文件
    4. 在继续操作之前,您需要了解一些事项。你尝试过&#34;追加模式&#34;使用FileMode.OpenOrCreate。这是一个很好的猜测,但不正确。 PDF格式既有开头也有结尾,所以&#34;从这里开始&#34;和&#34;在这里结束&#34;。当您尝试将另一个PDF(或其他任何内容)附加到现有文件时,您只需要通过&#34; end here&#34;部分。充其量,这些垃圾数据会被忽略,但更有可能最终导致损坏的PDF。几乎任何文件格式都是如此。连接的两个XML文件无效,因为XML文档只能有一个根元素。

      第二个但相关,iText / iTextSharp无法编辑现有文件。这是非常重要的。但是,它可以创建恰好具有其他文件的确切或可能修改版本的全新文件。我不知道我是否可以强调这是多么重要。

      第三,您正在使用一次又一次复制的行,但是非常错误,实际上可能会损坏您的数据。为什么它不好,read this

      currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
      

      第四,您正在使用RegEx,这是一种执行搜索的过于复杂的方式。也许您发布的代码只是一个示例,但如果不是,我建议您只使用currentText.Contains("")或者如果您需要忽略案例currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )。为了怀疑,下面的代码假设您有一个更复杂的RegEx。

      尽管如此,下面是一个完整的工作示例,应该引导您完成所有事情。由于我们无法访问您的PDF,因此第二部分实际上会创建100个示例PDF,并偶尔会添加我们的搜索字词。你真正的代码显然不会这样做但我们需要共同的基础来与你合作。第三部分是您尝试执行的搜索和合并功能。希望代码中的注释能够解释所有内容。

      /**
       * Step 1 - Variable Setup
       */
      
      //This is the folder that we'll be basing all other directory paths on
      var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
      
      //This folder will hold our PDFs with text that we're searching for
      var folderPathContainingPdfsToSearch = Path.Combine(workingFolder, "Pdfs");
      
      var folderPathContainingPdfsCombined = Path.Combine(workingFolder, "Pdfs Combined");
      
      //Create our directories if they don't already exist
      System.IO.Directory.CreateDirectory(folderPathContainingPdfsToSearch);
      System.IO.Directory.CreateDirectory(folderPathContainingPdfsCombined);
      
      var searchText1 = "ABC";
      var searchText2 = "DEF";
      
      /**
       * Step 2 - Create sample PDFs
       */
      
      //Create 100 sample PDFs
      for (var i = 0; i < 100; i++) {
          using (var fs = new FileStream(Path.Combine(folderPathContainingPdfsToSearch, i.ToString() + ".pdf"), FileMode.Create, FileAccess.Write, FileShare.None)) {
              using (var doc = new Document()) {
                  using (var writer = PdfWriter.GetInstance(doc, fs)) {
                      doc.Open();
      
                      //Add a title so we know what page we're on when we combine
                      doc.Add(new Paragraph(String.Format("This is page {0}", i)));
      
                      //Add various strings every once in a while.
                      //(Yes, I know this isn't evenly distributed but I haven't
                      // had enough coffee yet.)
                      if (i % 10 == 3) {
                          doc.Add(new Paragraph(searchText1));
                      } else if (i % 10 == 6) {
                          doc.Add(new Paragraph(searchText2));
                      } else if (i % 10 == 9) {
                          doc.Add(new Paragraph(searchText1 + searchText2));
                      } else {
                          doc.Add(new Paragraph("Blah blah blah"));
                      }
      
                      doc.Close();
                  }
              }
          }
      }
      
      /**
       * Step 3 - Search and merge
       */
      
      
      //We'll search for two different strings just to add some spice
      var reg = new Regex("(" + searchText1 + "|" + searchText2 + ")");
      
      //Loop through each file in the directory
      foreach (var filePath in Directory.EnumerateFiles(folderPathContainingPdfsToSearch, "*.pdf")) {
          using (var pdfReader = new PdfReader(filePath)) {
              for (var page = 1; page <= pdfReader.NumberOfPages; page++) {
      
                  //Get the text from the page
                  var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new SimpleTextExtractionStrategy());
      
                  currentText.IndexOf( "",  StringComparison.InvariantCultureIgnoreCase )
      
      
      
                  //DO NOT DO THIS EVER!! See this for why https://stackoverflow.com/a/10191879/231316
                  //currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
      
                  //Match our pattern against the extracted text
                  var matches = reg.Matches(currentText);
      
                  //Bail early if we can
                  if (matches.Count == 0) {
                      continue;
                  }
      
                  //Loop through each match
                  foreach (var m in matches) {
      
                      //This is the file path that we want to target
                      var destFile = Path.Combine(folderPathContainingPdfsCombined, m.ToString() + ".pdf");
      
                      //If the file doesn't already exist then just copy the file and move on
                      if (!File.Exists(destFile)) {
                          System.IO.File.Copy(filePath, destFile);
                          continue;
                      }
      
                      //The file exists so we're going to "append" the page
                      //However, writing to the end of file in Append mode doesn't work,
                      //that would be like "add a file to a zip" by concatenating two
                      //two files. In this case, we're actually creating a brand new file
                      //that "happens" to contain the original file and the matched file.
                      //Instead of writing to disk for this new file we're going to keep it
                      //in memory, delete the original file and write our new file
                      //back onto the old file
                      using (var ms = new MemoryStream()) {
      
                          //Use a wrapper helper provided by iText
                          var cc = new PdfConcatenate(ms);
      
                          //Open for writing
                          cc.Open();
      
                          //Import the existing file
                          using (var subReader = new PdfReader(destFile)) {
                              cc.AddPages(subReader);
                          }
      
                          //Import the matched file
                          //The OP stated a guarantee of only 1 page so we don't
                          //have to mess around with specify which page to import.
                          //Also, PdfConcatenate closes the supplied PdfReader so
                          //just use the variable pdfReader.
                          using (var subReader = new PdfReader(filePath)) {
                              cc.AddPages(subReader);
                          }
      
                          //Close for writing
                          cc.Close();
      
                          //Erase our exisiting file
                          File.Delete(destFile);
      
                          //Write our new file
                          File.WriteAllBytes(destFile, ms.ToArray());
                      }
                  }
              }
          }
      }
      

答案 1 :(得分:0)

我将用伪代码写这个。

你做这样的事情:

// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        // create single-page PDF
        new Document();
        new PdfCopy();
        document.Open();
        copy.add(singlePage);
        document.Close();
    }
}

这意味着您每次满足条件时都要创建单页PDF。顺便说一下,您多次覆盖现有文件。

你应该做什么,是这样的:

// Create a document with as many pages as times a condition is met
new Document();
new PdfCopy();
document.Open();
// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        copy.addPage(singlePage);
    }
}
document.Close();

现在,您可能会使用PdfCopy为要创建的新文档添加多个页面。注意:如果从未满足条件,则可以抛出异常。