根据使用正则表达式找到的文本将PDF拆分为单独的文件

时间:2018-11-23 20:20:00

标签: regex vb.net pdf split

我有一个使用ByteScout.PDFExtractor的PDF拆分器。我的代码搜索唯一的标识标头,即“ TP ###### SIGNED AFFIDAVIT”

#可以是0到9之间的任何整数。我正在使用正则表达式搜索这些标头,如下所示:

Dim regexPattern =“ * TP [0-9] {6} * * SIGNED AFFIDAVIT *”

这正在工作。事实是,它逐页拆分文档,因此在拆分时,我在目录中得到以下内容:

TP02433 SIGNED AFFIDAVIT 1
TP02433 SIGNED AFFIDAVIT 2
TP02433 SIGNED AFFIDAVIT 3
TP02354 SIGNED AFFIDAVIT 4
TP02354 SIGNED AFFIDAVIT 5
TP02354 SIGNED AFFIDAVIT 6 ...

我的问题是,我该如何对我的代码进行处理,以使它在找到TP02433之类的信息时将其保持在一起,直到找到下一个TP#。

有没有一种方法可以找到“ TP [0-9] {6} SIGNED AFFIDAVIT ”,然后提取所有文档,将它们保持在一起,直到找到下一个唯一的“ TP [0-9] {6} 签名的亲密证据”?

以使结尾看起来像这样:

TP02433 SIGNED AFFIDAVIT (1 - 3)
TP02354 SIGNED AFFIDAVIT (4 - 6) ?

这是我到目前为止的工作代码:

Imports System.IO
Imports Bytescout.PDFExtractor
Imports Microsoft.Office.Interop
Imports System.IO.Path
Imports System.Text
Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        Dim unmerged = Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Tesspdf")

        Dim pdfFile As String = "G:\Word\Department Folders\Pre-Suit\Xavier\MPOP.pdf"

        Dim extractor As New TextExtractor()

        extractor.WordMatchingMode = WordMatchingMode.ExactMatch

        extractor.LoadDocumentFromFile(pdfFile)

        Dim pageCount = extractor.GetPageCount()

        Dim currentPageTypeName = "UNKNOWN"
        Dim PageTypeName = "test"    
        extractor.RegexSearch = True
        Dim regexPattern = "\*TP[0-9]{6}\* \*SIGNED AFFIDAVIT\*"



        For i = 0 To pageCount - 1


            If extractor.Find(i, regexPattern, False) Then                

                            PageTypeName = Regex.Replace(extractor.TextFound.Text, "[^A-Za-z0-9\-/#\s]", "")

                    currentPageTypeName = PageTypeName

                End If


                Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}

                    Dim pageNumber = i + 1   ' (!) page number in ExtractPage() is 1-based


                If Not Directory.Exists(unmerged) Then
                        Directory.CreateDirectory(unmerged)
                    End If

                Dim outputfile = Combine(unmerged, currentPageTypeName & " " & pageNumber & ".pdf")


                splitter.ExtractPage(pdfFile, outputfile, pageNumber) 

            End Using
            Next
            extractor.Dispose()



    End Sub

End Module

我将使用ExtractPageRange页面有所不同。所以我想知道这段代码是否可以找到第一个“ * TP [0-9] {6} * * SIGNED AFFIDAVIT *”,提取该标头之后的所有页面,直到到达下一个“ * TP [0-9] {6 } * * SIGNED AFFIDAVIT *“,直到pdf文件完全分割为止,是否一样?

0 个答案:

没有答案