找到一个字符串并更有效地替换它

时间:2010-07-03 08:39:06

标签: asp.net regex

情况:我有一个html文件,我需要删除某些部分。

例如:该文件包含html:<div style="padding:10px;">First Name:</div><div style="padding:10px; background-color: gray">random information here</div><div style="padding:10px;">First Name:</div><div style="padding:10px; background-color: gray">random information here</div>

我需要删除所有以“<div style="padding:10px; background-color: gray">”开头并以“</div>”结尾的文字,以便结果为:

<div style="padding:10px;">First Name:</div><div style="padding:10px;">First Name:</div>

我创建了2个执行此操作的函数,但我认为它根本没有效率。我有一个40mb的文件,程序需要2个小时才能完成。有没有更有效的方法来做到这一点?有没有办法使用正则表达式?

请参阅下面的代码:

Public Shared Function String_RemoveText(ByVal startAt As String, ByVal endAt As String, ByVal SourceString As String) As String
    Dim TotalCount As Integer = String_CountCharacters(SourceString, startAt)
    Dim CurrentCount As Integer = 0

RemoveNextString:

    Dim LeftRemoved As String = Mid(SourceString, InStr(SourceString, startAt) + 1, Len(SourceString) - Len(endAt))
    Dim RemoveCore As String = Left(LeftRemoved, InStr(LeftRemoved, endAt) - 1)
    Dim RemoveString As String = startAt & RemoveCore & endAt


    Do
        '    Application.DoEvents()
        SourceString = Replace(SourceString, RemoveString, "")
        If InStr(SourceString, startAt) < 1 Then Exit Do
        GoTo RemoveNextString
    Loop

    Return Replace(SourceString, RemoveString, "")

End Function

Public Shared Sub Files_ReplaceText(ByVal DirectoryPath As String, ByVal SourceFile As String, ByVal DestinationFile As String, ByVal sFind As String, ByVal sReplace As String, ByVal TrimContents As Boolean, ByVal RemoveCharacters As Boolean, ByVal rStart As String, ByVal rEnd As String)

    'CREATE NEW FILENAME
    Dim DateFileName As String = Date.Now.ToString.Replace(":", "_")
    DateFileName = DateFileName.Replace(" ", "_")
    DateFileName = DateFileName.Replace("/", "_")
    Dim FileExtension As String = ".txt"
    Dim NewFileName As String = DirectoryPath & DateFileName & FileExtension
    'CHECK IF FILENAME ALREADY EXISTS
    Dim counter As Integer = 0
    If IO.File.Exists(NewFileName) = True Then
        'CREATE NEW FILE NAME
        Do
            'Application.DoEvents()
            counter = counter + 1
            If IO.File.Exists(DirectoryPath & DateFileName & "_" & counter & FileExtension) = False Then
                NewFileName = DirectoryPath & DateFileName & "_" & counter & FileExtension
                Exit Do
            End If
        Loop
    End If
    'END NEW FILENAME

    'READ SOURCE FILE
    Dim sr As New StreamReader(DirectoryPath & SourceFile)
    Dim content As String = sr.ReadToEnd()
    sr.Close()

    'WRITE NEW FILE
    Dim sw As New StreamWriter(NewFileName)

    'REPLACE VALUES
    content = content.Replace(sFind, sReplace)

    'REMOVE STRINGS
    If RemoveCharacters = True Then content = String_RemoveText(rStart, rEnd, content)


    'TRIM
    If TrimContents = True Then content = Regex.Replace(content, "[\t]", "")

    'WRITE FILE
    sw.Write(content)

    'CLOSE FILE
    sw.Close()
End Sub

执行代码的示例(也删除Chr(13)&amp; Chr(10): Files_ReplaceText(tPath.Text, tSource.Text, "", Chr(13) & Chr(10), "", True, True, tStart.Text, tEnd.Text)

1 个答案:

答案 0 :(得分:2)

不要使用RegEx来解析HTML - 它不是常规语言。有关引人注目的演示,请参阅here

使用HTML Agility Pack解析HTML并替换数据。