删除所有文本文件中的无效链接

时间:2016-08-06 13:26:53

标签: regex vb.net

如何在文本文件中找到一个字符串(如果需要,使用正则表达式),然后稍微修改它,然后在相同的文件中再次找到它并且它不匹配,然后从这些文件中删除某个标记。 / p>

示例输入:

<sec id="sec1">
<p>"You fig. 23 did?" I <a href rid="sec12">section 12</a> asked, surprised.</p>
<p>"Cross sent it table 9 to me a few weeks ago." Stanton crossed over to my mother, taking her hand in his. "I <a href rid="sec2">section 2</a> couldn"t have argued for better terms."</p>
<p>"There are always better terms, Richard!" my mom said sharply.</p>
<p>"There are <xref ref-type="biblio" rid="ref2">[2]</xref> rewards for milestones such as anniversaries and the birth of children, and nothing in the way of penalties for Eva, aside from marit table 9al counseling. A dissolution would have a more than equit table 9able distribution of assets. I <a href rid="sec2">section 2</a> was tempted to ask if Cross had his in-house counsel review it table 9. I <a href rid="sec2">section 2</a> imagine they argued strenuously against it table 9."</p>
<p>She settled for a moment, taking that in. Then she pushed to her feet, bristling. "But you knew they were eloping? You fig. 23 knew, and you didn"t say anything?"</p>
<p>"Of course, I <a href rid="sec2">section 2</a> didn"t know." He pulled her into his arms, crooning softly like he would wit table 9h a child. "I <a href rid="sec2">section 2</a> assumed he was looking ahead. You fig. 23 know these things usually take a few months of negotiating. Although, in this case, there was nothing more I <a href rid="sec2">section 2</a> could"ve asked for."</p>
<p>I <a href rid="sec2">section 2</a> stood. I <a href rid="sec2">section 2</a> had to hurry if I <a href rid="sec2">section 2</a> was going to get to work on time. Today of all days, I <a href rid="sec2">section 2</a> didn"t want to be late.</p>
<p>"Where are you <xref ref-type="biblio" rid="ref14">[14]</xref> going?" My mother straightened away from Stanton. "We"re not done wit table 9h this discussion. You fig. 23 can"t just drop a bomb like that and leave!"
<fig id="fig4">
<caption><p>I'm confused</p></caption>
</fig>  
</p>
<p>Turning to face her, I <a href rid="sec2">section 2</a> walked backward. "I"ve seriously got to get ready. Why don"t we get together for lunch and talk more then?"</p>
<sec id="sec2">
<p>"You fig. 23 can"t be""</p>
<p>I <a href rid="sec2">section 2</a> cut her <xref ref-type="biblio" rid="ref1">[1]</xref>, <xref ref-type="biblio" rid="ref3">[3]</xref> off. "Corinne Giroux."</p>
<p>My mother"s eyes widened, then narrowed. One name. I <a href rid="sec5">section 5</a> didn"t have to say anything else.</p>
<p>Gideon"s ex was a problem that needed no further explanation.</p>
<p>It was the rare person who came to Manhattan and didn"t feel an instant familiarit table 9y. The skyline of the cit table 9y had been immortalized in too many movies and television shows to count, spreading the love affair wit table 9h New York from residents to the world.</p>
<p>I <a href rid="sec2">section 2</a> was no exception.</p>
<p>I <a href rid="sec4">section 4</a> adored the Art Deco elegance of the Chrysler Building. I <a href rid="sec2">section 2</a> could pinpoint my place on the island in relation to the posit table 9ion of the Empire State Building. I <a href rid="sec2">section 2</a> was awed by the breathtaking height of the Freedom Tower that now dominated downtown. But the Crossfire Building was in a class by it table 9self. I"d thought so before I <a href rid="sec2">section 2</a> had ever fallen in love wit table 9h the man whose vision had led to it table 9s creation.</p>
<p>As Ra"l pulled the Benz up to <xref ref-type="biblio" rid="ref15">[15]</xref> the curb, I <a href rid="sec2">section 2</a> marveled at the distinctive sapphire blue glass that encased the obelisk shape of the Crossfire. My head tilted back, my gaze sliding up the shimmering height to the point at the top, the light-drenched space that housed Cross Industries. Pedestrians surged around me, the sidewalk teeming wit table 9h businessmen and -women heading to work wit table 9h briefcases and totes in one hand and steaming cups of coffee in the other.</p>
<p>I <a href rid="sec1">section 1</a> felt Gideon before I <a href rid="sec1">section 1</a> saw him, my entire body humming wit table 9h awareness as he stepped out of the Bentley, which had pulled up behind the Benz. The air around me charged wit table 9h electricit table 9y, the crackling energy that always heralded the approach of a storm.</p>
</sec>
</sec>

我到目前为止编写的代码是

Imports System.IO
Imports System.Text.RegularExpressions
Public Class Form1
    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        If FolderBrowserDialog1.ShowDialog = DialogResult.OK Then
            TextBox1.Text = FolderBrowserDialog1.SelectedPath
        End If
    End Sub

    Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
        Dim targetDirectory As String
        targetDirectory = TextBox1.Text
        Dim txtFilesArray As String() = Directory.GetFiles(targetDirectory, "*.txt")
        For Each txtFile In txtFilesArray
            Dim FileInfo As New FileInfo(txtFile)
            Dim FileLocation As String = FileInfo.FullName
            Dim input() As String = File.ReadAllLines(FileLocation)
            Dim pattern As String = "(?<=rid="sec)(\d+)(?=">)"
            Dim r As Regex = New Regex(pattern)
            Dim m As Match = r.Match(input)
            If (m.Success) Then
                Dim x As String = " id=""sec" + pattern + """"
                Dim r2 As Regex = New Regex(x)
                Dim m2 As Match = r2.Match(input)
                If (m2.Success) Then
                    Dim tgPat As String = "<a href rid=""sec + pattern +"">(\w+) (\d+)</a>"
                    Dim tgRep As String = "$1 $2"
                    Dim tgReg As New Regex(tgPat)
                    Dim result1 As String = tgReg.Replace(input, tgRep)
                Else
                End If
            End If
        Next
    End Sub
End Class

代码明确不完整且有缺陷,任何人都可以帮忙吗? 基本上,它会在文件中搜索rid="sec[0-9]+",然后将其与<sec id="sec[0-9]+">的{​​{1}}进行匹配,当它找不到任何匹配项时,会删除该链接。我怎样才能做到这一点?

1 个答案:

答案 0 :(得分:0)

可能更可靠的替代方法是解析XML,但输出不会保留<caption>标记周围的新行。

Dim sInput = IO.File.ReadAllText("input.txt")
sInput = sInput.Replace("<a href ", "<a href="""" ") ' because " href " is not valid parsable XML
Dim xInput = XElement.Parse(sInput)

' this is where the magic happens
Dim aTags = xInput...<a>    ' all anchor tags
Dim gRIDs = aTags.GroupBy(Function(x) x.@rid)   ' group by the rid attribute
For Each g In gRIDs
    If g.Count = 1 Then
        g(0).ReplaceWith(g(0).Value) ' replaces the XElement <a href="" rid="sec12">section 12</a> with it's Value section 12
    End If
Next

Dim sOutput = xInput.ToString
sOutput = sOutput.Replace("<a href="""" ", "<a href ") ' optional to change the  href="" back to href
sOutput = sOutput.Replace("  ", "") ' optional to remove indentation
IO.File.WriteAllText("output.txt", sOutput)

<强>更新

Dim sInput = IO.File.ReadAllText("input.txt")
Dim splitBy = "<a href rid="""
Dim aInput = Split(sInput, splitBy)

Dim groups = Enumerable.Range(1, aInput.Length - 1).GroupBy(Function(i) Split(aInput(i), """", 2)(0)) ' group by string between '<a href rid="' and '"'

For Each g In groups
    If g.Count = 1 Then
        aInput(g(0)) = Split(aInput(g(0)), ">", 2)(1).Replace("</a>", "")  ' Example: 'sec12">section 12</a> asked..' to 'section 12 asked..'
    Else
        For Each i In g
            aInput(i) = splitBy & aInput(i)  ' Example: 'sec12">section 12</a> asked..' to '<a href rid="sec12">section 12</a> asked..'
        Next
    End If
Next

Dim sOutput = Join(aInput, "")
IO.File.WriteAllText("output.txt", sOutput)