RegEx用于匹配VB.net中的特殊模式

时间:2019-05-08 16:37:29

标签: regex vb.net regex-lookarounds regex-group regex-greedy

我有使用实体引用(&Ch1;)文件提取不同SGM文件中的文本的代码。该代码对此非常有用,但现在已扩展为需要为带有分段引用&Ch1-1;的实体调用获取节文件的实体引用。这也可以增长到&Ch1-1-1;

我需要扩展代码以接受这些新实体,以便可以将那些文件内容添加到主文件中。

我认为问题是使用的正则表达式,所以我将其更改为

Dim rx = New Regex("&Ch(?<EntityNumber>\d+?[-\d+]?)?")

这不会产生错误,但是也不会将文件内容带入主文档中。我已经习惯了正则表达式,但是我从未使用过命名捕获组,并在网络上发现了一些令人困惑的解释。

Sub runProgram()
    Dim DirFolder As String = txtDirectory.Text
    Dim Directory As New IO.DirectoryInfo(DirFolder)
    Dim allFiles As IO.FileInfo() = Directory.GetFiles("*.sgm")
    Dim singleFile As IO.FileInfo
    Dim Response As String


    Dim Prefix As String
    Dim newMasterFilePath As String
    Dim masterFileName As String
    Dim newMasterFileName As String
    Dim startMark As String = "<!--#start#-->"
    Dim stopMark As String = "<!--#stop#-->"
    searchDir = txtDirectory.Text
    Prefix = txtBxUnique.Text
    For Each singleFile In allFiles
        If File.Exists(singleFile.FullName) Then
            Dim fileName = singleFile.FullName
            Debug.Print("file name : " & fileName)
            ' A backup first    
            Dim backup As String = fileName & ".bak"
            File.Copy(fileName, backup, True)

            ' Load lines from the source file in memory
            Dim lines() As String = File.ReadAllLines(backup)

            ' Now re-create the source file and start writing lines inside a block
            Dim insideBlock As Boolean = False
            Using sw As StreamWriter = File.CreateText(backup)
                For Each line As String In lines
                    If line = startMark Then
                        ' start writing at the line below
                        insideBlock = True
                    ElseIf line = stopMark Then
                        ' Stop writing
                        insideBlock = False
                    ElseIf insideBlock = True Then
                        ' Write the current line in the block
                        sw.WriteLine(line)
                    End If
                Next
            End Using
        End If
    Next

    masterFileName = Prefix & $"_Master_Document.sgm"
    newMasterFileName = Prefix & $"_New_Master_Document.sgm"
    newMasterFilePath = IO.Path.Combine(searchDir, newMasterFileName)

    Dim existingMasterFilePath = IO.Path.Combine(searchDir, masterFileName)


    'Read all text of the Master Document
    'and create a StringBuilder from it.
    'All replacements will be done on the
    'StringBuilder as it is more efficient
    'than using Strings directly
    Dim strMasterDoc = File.ReadAllText(existingMasterFilePath)
    Dim newMasterFileBuilder As New StringBuilder(strMasterDoc)

    'Create a regex with a named capture group.
    'The name is 'EntityNumber' and captures just the
    'entity digits for use in building the file name
    Dim rx = New Regex("&Ch(?<EntityNumber>\d+(-?\d*)*)?")
    Dim rxMatches = rx.Matches(strMasterDoc)

    For Each match As Match In rxMatches
        Dim entity = match.ToString
        'Build the file name using the captured digits from the entity in the master file
        Dim entityFileName = Prefix & $"_Ch{match.Groups("EntityNumber")}.sgm.bak"
        Dim entityFilePath = Path.Combine(searchDir, entityFileName)
        'Check if the entity file exists and use its contents
        'to replace the entity in the copy of the master file
        'contained in the StringBuilder
        If File.Exists(entityFilePath) Then
            Dim entityFileContents As String = File.ReadAllText(entityFilePath)
            newMasterFileBuilder.Replace(entity, entityFileContents)
        End If
    Next


    'write the processed contents of the master file to a different file
    File.WriteAllText(newMasterFilePath, newMasterFileBuilder.ToString)

    Dim largeFilePath As String = newMasterFilePath
    Dim lines1 = File.ReadLines(largeFilePath).ToList 'don't use ReadAllLines
    Dim reg = New Regex("\<\!NOTATION.*$|\<\!ENTITY.*$", RegexOptions.IgnoreCase)
    Dim entities = From line In lines1
                   Where reg.IsMatch(line)


    Dim dictionary As New Dictionary(Of Integer, String)
    Dim idx = -1
    For Each s In entities
        idx = lines1.IndexOf(s, idx + 1)
        dictionary.Add(idx, s.Trim)
    Next

    Dim deletedItems = 0
    For Each itm In dictionary
        lines1.RemoveAt(itm.Key - deletedItems)
        deletedItems += 1
    Next

    Dim uniqueDict = dictionary.GroupBy(Function(itm) itm.Value).
    Select(Function(group) group.First()).
    ToDictionary(Function(itm) itm.Key, Function(itm) itm.Value)

    For Each s In uniqueDict.Values
        lines1.Insert(1, s)
    Next


    Dim builtMaster As String = Prefix & "_FinalDeliverable.sgm"
    Dim newBuiltMasterFilePath = IO.Path.Combine(searchDir, builtMaster)
    Dim builtMasterDoc As String = newBuiltMasterFilePath
    Using sw As New System.IO.StreamWriter(builtMasterDoc)
        For Each line As String In lines1
            sw.WriteLine(line)
        Next
        sw.Flush()
        sw.Close()
    End Using

    'Delete the master document and new master document

    If System.IO.File.Exists(existingMasterFilePath) = True Then
        System.IO.File.Delete(existingMasterFilePath)
    End If

    If System.IO.File.Exists(newMasterFilePath) = True Then
        System.IO.File.Delete(newMasterFilePath)
    End If

    For Each filename As String In IO.Directory.GetFiles(searchDir, "*.bak")
        IO.File.Delete(filename)
    Next


    Response = MsgBox("File 'FinalDeliverable.sgm' has been created.", vbOKOnly, "SGM Status")
    If Response = vbOK Then    ' User chose Yes.
        Close()
    Else    ' User chose No.
        ' Perform some action.
    End If
End Sub

我期望的结果是名称为Ch1-1.sgm的文件,内容将被添加到主文件中。

这确实适用于&Ch1;的文件实体。它会正确捕获Ch1.sgm内容。

感谢您的帮助, 马克西姆

示例代码: Master_Document.sgm

<!DOCTYPE DOC PUBLIC "-//USA-DOD//DTD 38784STD-BV7//EN"[
]>
&Ch1;
<body numcols="2">
&Ch2-1;
&Ch2-2;
&Ch2-3;
&Ch2-4;
&Ch2-5;
&Ch2-6;
&Ch2-7;
&Ch2-8;
&Ch2-9;
&Ch3;
</body></doc>

示例SGM文件

 <?Pub /_gtinsert>                     
    <body numcols="2">                    
    <!--#start#-->                        
    <chapter id="Chapter_4__Procedures">  
    <title>Procedures</title>             
    <section>                             
    <title>Introduction</title>           
    <!--#stop#-->                         
    <para0 verdate="7 Never 2012" verstatu
    <title>Description</title>            
    <para>This chapterfor the following:  

1 个答案:

答案 0 :(得分:1)

事实证明,问题是&Ch(?<EntityNumber>\d+?[-\d+]?)?&Ch匹配,然后匹配一个或多个(但尽可能少的)数字(带有\d+?),然后匹配一个可选的{{1 }},数字或-符号。也就是说,在+之后,仅匹配了1个数字(因为在您的情况下总是有一个数字),然后如果匹配了&Ch,则匹配停止了。

使用

-

请参见regex demo和正则表达式图:

enter image description here