如何在VB.NET中替换单词Document中的html标记文本

时间:2015-02-27 07:29:06

标签: vb.net

我有一个VB.NET代码,它始终查找并替换Word文档文件(.docx)中的文本。我正在使用OpenXml进行此过程。 但我想只替换HTML标记文本,并在替换文档中的新文本后始终删除标记。

我的代码是:

Public Sub SearchAndReplace(ByVal document As String)

    Dim wordDoc As WordprocessingDocument = WordprocessingDocument.Open(document, True)
    Using (wordDoc)
        Dim docText As String = Nothing
        Dim sr As StreamReader = New StreamReader(wordDoc.MainDocumentPart.GetStream)

        Using (sr)
            docText = sr.ReadToEnd
        End Using

        Dim regexText As Regex = New Regex("<ReplaceText>")
        docText = regexText.Replace(docText, "Hi Everyone!")
        Dim sw As StreamWriter = New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))

        Using (sw)
            sw.Write(docText)
        End Using
    End Using

1 个答案:

答案 0 :(得分:0)

以下是帮助您解决问题的方法。

Imports System.Text.RegularExpressions
Module Module1
    Sub Main()
        Dim Text As String = "Blah<foo>Blah"
        'Prints Text
        Console.WriteLine(Text)
        Dim regex As New Regex("(<)[]\w\/]+(>)")
        'Prints Text after replace the in-between the capturing group 1 and 2. 
        'Capturing group are marked between parenthesis in the regex pattern 
        Console.WriteLine(regex.Replace(Text, "$1foo has been replaced.$2"))
        'Update Text
        Text = regex.Replace(Text, "$1foo has been replaced.$2")
        'Remove starting tag
        Dim p As Integer = InStr(Text, "<")
        Text = Text.Remove(p - 1, 1)
        'Remove trailing tag
        Dim pp As Integer = InStr(Text, ">")
        Text = Text.Remove(pp - 1, 1)
        'Print Text
        Console.WriteLine(Text)
        Console.ReadLine()
    End Sub

End Module

输出:

enter image description here

如果每行有多个标记,则上述代码将无法运行。

我建议不要使用正则表达式来解析HTML。