我有一个巨大的文本文件,其中发生了大量的重复。重复如下。
总帖子数16
密码= GFDHG
TITLE =伦敦的商店标志/投影标志/工业标牌/餐厅标志/菜单板和盒子
DATE = 12-09-2012
跟踪密钥#85265E712050-15207427406854753
总帖子数16
密码= GFDHG
TITLE =伦敦的商店标志/投影标志/工业标牌/餐厅标志/菜单板和盒子
DATE = 12-09-2012
跟踪密钥#85265E712050-15207427406854753
总帖子2894
密码= GFDHG
TITLE =伦敦的商店标志/投影标志/工业标牌/餐厅标志/菜单板和盒子
DATE = 15-09-2012
跟踪密钥#85265E712050-152797637654753
总帖子2894
密码= GFDHG
TITLE =伦敦的商店标志/投影标志/工业标牌/餐厅标志/菜单板和盒子
DATE = 15-09-2012
跟踪密钥#85265E712050-152797637654753
等文本文件中共有4000个帖子。我希望我的程序匹配总帖子6到文件中发生的所有总帖子,并找到副本,然后以编程方式删除该副本,并删除该副本的后7行。谢谢
答案 0 :(得分:0)
假设格式是一致的(即文件中的每个记录事件使用6行文本),那么如果您要从文件中删除重复项,则只需要执行以下操作:
Sub DupClean(ByVal fpath As String) 'fpath is the FULL file path, i.e. C:\Users\username\Documents\filename.txt
Dim OrigText As String = ""
Dim CleanText As String = ""
Dim CText As String = ""
Dim SReader As New System.IO.StreamReader(fpath, System.Text.Encoding.UTF8)
Dim TxtLines As New List(Of String)
Dim i As Long = 0
Dim writer As New System.IO.StreamWriter(Left(fpath, fpath.Length - 4) & "_clean.txt", False) 'to overwrite the text inside the same file simply use StreamWriter(fpath)
Try
'Read in the text
OrigText = SReader.ReadToEnd
'Parse the text at new lines to allow selecting groups of 6 lines
TxtLines.AddRange(Split(OrigText, Chr(10))) 'may need to change the Chr # to look for depending on if 10 or 13 is used when the file is generated
Catch ex As Exception
MsgBox("Encountered an error while reading in the text file contents and parsing them. Details: " & ex.Message, vbOKOnly, "Read Error")
End
End Try
Try
'Now we iterate through blocks of 6 lines
Do While i < TxtLines.Count
'Set CText to the next 6 lines of text
CText = TxtLines.Item(i) & Chr(10) & TxtLines.Item(i + 1) & Chr(10) & TxtLines.Item(i + 2) & Chr(10) & TxtLines.Item(i + 3) & Chr(10) & TxtLines.Item(i + 4) & Chr(10) & TxtLines.Item(i + 5)
'Check if CText is already present in CleanText
If Not (CleanText.Contains(CText)) Then
'Add CText to CleanText
If CleanText.Length = 0 Then
CleanText = CText
Else
CleanText = CleanText & Chr(10) & CText
End If
End If 'else the text is already present and we don't need to do anything
i = i + 6
Loop
Catch ex As Exception
MsgBox("Encountered an error while running cleaning duplicates from the read in text. The application was on the " & i & "-th line of text when the following error was thrown: " & ex.Message, _
vbOKOnly, "Comparison Error")
End
End Try
Try
'Write out the clean text
writer.Write(CleanText)
Catch ex As Exception
MsgBox("Encountered an error writing the cleaned text. Details: " & ex.Message & Chr(10) & Chr(10) & "The cleaned text was " & CleanText, vbOKOnly, "Write Error")
End Try
End Sub
如果格式不一致,你需要变得更加漂亮并定义规则来告诉在任何给定的循环中添加到CText的行,但没有上下文我就无法给你任何想法至于那些可能是什么。