在PDF中找到一个单词,然后在该单词之后返回11个字符?

时间:2018-10-16 20:58:29

标签: regex vb.net pdf itext wildcard

我有我的代码,可以搜索包含单词Data_ID的PDF文档的每个PDF页面。

此文档位于此PDF文档的其他每一页上,其更改方式如下:

data_id 400M549822

data_id 400M549233

ETC ..

所以现在我的控制台一直在返回它找到字符串data_id的所有时间,但是我也希望它在它之后返回那些字符...

这是我到目前为止所拥有的:

Imports Bytescout.PDFExtractor
Imports System.IO
Imports System.Text.RegularExpressions

Module Module1
    Class PageType
        Property Identifier As String
    End Class

    Sub Main()
        Dim direcory = "C:\Users\XBorja.RESURGENCE\Desktop\one main\"
        Dim pageTypes As New List(Of PageType)
        Dim ids = "data_id"
        Dim resultstring As String
        resultstring = Regex.Match(ids, "(?<=^.{1}).*(?=.{5}$)").Value

        Dim currentPageTypeName = "unknown"

        For Each inputfile As String In Directory.GetFiles(direcory)
            For i = 0 To ids.Length - 1
                pageTypes.Add(New PageType With {.Identifier = ids(i)})
            Next

            Dim extractor As New TextExtractor()
            extractor.LoadDocumentFromFile(inputfile)
            Dim pageCount = extractor.GetPageCount()

            For i = 0 To pageCount - 1
                '        ' Find the type of the current page
                '        ' If it is not present on the page, then the last one found will be used.
                For Each pt In pageTypes
                    Console.WriteLine(resultstring)
                Next
            Next
        Next
    End Sub
End Module

resultstring是我试图与正则表达式一起使用的内容,但它只是计算data_id中的位置,而不是其后的位置。

那么我该怎么做,以使其在单词data_id后面返回以下10个字符(不包括空格)?

1 个答案:

答案 0 :(得分:1)

返回11个字符,并在前面加上空格:

'Dim ids = "data_id 400M549822"
Dim ids = "data_id 400M549233"
Dim resultstring = Regex.Match(ids, "(?<=data_id)(\s\w{10})$").Value
Console.WriteLine(resultstring)

输出:

 400M549233

一些注意事项:

?<= =积极回望
\s =一个空格
\w{10} = 10个字符,包括A-> Z,a-> z,0-> 9,_