在导入MS Access之前,我尝试了很多脚本和方法来清理大文本文件。
文本文件为500k +行。有些行包含“回车”或“换行符”。这些在记事本中显示为方形符号。 (有趣的是在Windows XP中它们是正方形,但在Windows 2003中它们不会出现在记事本中,但会将文本分解到下一行/行。
每个字段都不应出现这些字段。因此,我需要一种从文件中删除所有这些内容的方法。
文本文件内容示例:
FIELD_NAME1|FIELD_NAME2 |FIELD_NAME3
John |He likes food |1002
Jake |He eats food |1004
Jake |He eats food and [][] likes swimming|1003
1)一种解决方案是读取文件并修复行。无论如何难以实现这一点。通常,您只会根据后续行中的错误意识到该行是错误的。
2)另一种方法是将文本文件拆分为较小的位。然后使用查找和替换。一旦清理完 - 再粘在一起进入MS Access。
有一种简单的方法吗?
此任务只需运行几次,因此自动化并不重要。
分析输出由dmuk添加,然后由Tony Dallimore编辑
请参阅我的(Tony Dallimore)答案,了解该分析输出的解释。我没想到会找到这么长的控制字符串(例如由44个空行引起)。我在第1列中包含了这些长字符串以提高可读性。
String | File | Line | File | Line
13 10 | 1 | 1 | 376 | 626
9 | 1 | 2299 | 375 | 3524
9 9 | 3 | 6106 | 67 | 6111
9 9 9 9 | 6 | 1916 | 53 | 1492
9 9 9 | 6 | 1917 | 53 | 1493
9 9 9 9 9 | 42 | 1266 | 42 | 1266
10 | 69 | 1524 | 240 | 4885
10 10 | 69 | 3577 | 222 | 4651
13 10 13 10 | 71 | 3697 | 374 | 3258
13 10 10 | 80 | 5440 | 240 | 4166
13 10 13 10 13| 81 | 2657 | 290 | 2094
10 13 10 | | | |
13 10 13 10 13| 81 | 2662 | 215 | 1802
10 | | | |
13 10 13 10 10| 86 | 2082 | 86 | 6914
10 10 10 | 88 | 1314 | 221 | 4754
9 10 | 94 | 246 | 94 | 246
13 10 13 10 13| 126 | 1699 | 126 | 1699
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| 143 | 2078 | 143 | 2078
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 | | | |
10 10 10 10 | 182 | 1846 | 188 | 2663
10 10 10 10 10| 195 | 3320 | 195 | 3320
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 | | | |
13 10 13 10 13| 198 | 4223 | 198 | 4223
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 | 198 | 4223 | 198 | 4223
10 10 10 10 10| 213 | 5449 | 213 | 5449
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 10 10 10 10| | | |
10 | | | |
13 10 13 10 13| 278 | 788 | 278 | 788
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 13| | | |
10 13 10 13 10| | | |
13 10 13 10 | | | |
答案 0 :(得分:2)
<强>简介强>
起初看来问题是额外的回车。第一个解决方案(已删除)搜索单个CR并将其删除。这没有任何有益效果,因此很明显问题不是额外的回车。我提供了下面的分析代码,以便我们能够正确评估真实情况。此分析例程的输出已添加到原始问题中。对这一输出的回顾揭示了真正的问题:
基于这些发现的修订解决方案低于分析代码。
<强>分析强>
您需要在模块中包含以下代码。该例程需要一个名为“DiagInfo”的工作表。
代码循环从输入文件中读取大约1 Mb的块。它将每个块拆分成行,任何控制字符都用作行终止符。它为每个块创建一个输出文件。
在例程的顶部附近,你会发现:
' ###### Replace names as required
FileInNameRoot = "TestSplitLine In"
FileOutNameRoot = "TestSplitLine Out"
输入文件为:FileInNameRoot & ".txt"
。
输出文件的名称为:FileOutNameRoot & " 001.txt"
,FileOutNameRoot & " 002.txt"
,FileOutNameRoot & " 003.txt"
等。
如果您愿意,可以将块大小从1 Mb更改。块例程为1,000,000,但例程速度非常快,但输出文件的数量要多十倍。我发现1 Mb为我提供了可以使用NotePad轻松访问的文件。
输出如下:
000001 FIELD_NAME1|FIELD_NAME2|FIELD_NAME3 13 10
000002 John|He likes food|1002 13 10
000003 Jake|He eats food|1004 13 10
000004 Jake|He eats food and 13
000005 likes swimming|1003 13 10
000006 John|He likes food|1002 13 10
000007 Jake|He eats food|1004 13 10
000008 Jake|He eats food and 20 27 0 4
前七个字符是后跟空格的行号。一行由任何控制字符结束。输入文件中的显示字符输出不变。每个控制字符作为空格输出,后跟其代码值。大多数线路由13 10(CR LF)终止,但线路4由13(CR)终止,线路8由20 27 0 4(DC4 ESC NUL EOT)终止。
工作表“DiagInfo”看起来像:
First Last
String File Line File Line
13 10 1 1 66 5786
13 1 4 66 5666
20 27 0 4 1 8 66 5670
A列包含例程找到的每个不同控制字符串。列B和C包含第一次出现的文件和行号。列D和E包含最后一次出现的文件和行号。
例程使用工作表“DiagInfo”作为原始进度指示器,最后一行显示当前输出文件编号,最后一行编号是100的倍数。对于我的63Mb测试文件,例程需要2分钟。 / p>
这将告诉我们我们正在处理什么,并允许我们做出相应的计划。
Option Explicit
Sub AnalyseFileAndSplitIntoBlocks()
Dim Block As String
Dim BlockLen As Long
Dim CtrlChr As Long
Dim CtrlChrStg As String
Dim FileIn As Object
Dim FileInNameRoot As String
Dim FileOut As Object
Dim FileOutNameRoot As String
Dim Found As Boolean
Dim FSO As Object
Dim LineOut As String
Dim NumFileOut As Long
Dim NumLine As Long
Dim PathCrnt As String
Dim PosCrnt As Long
Dim PosStart As Long
Dim RowDiagCrnt As Long
Dim RowDiagNext As Long
Dim StartTime As Single
Dim TrailingFromLastBlock As String
StartTime = Timer
' ###### Replace names as required
FileInNameRoot = "TestSplitLine In"
FileOutNameRoot = "TestSplitLine Out"
With Worksheets("DiagInfo")
.Activate
.Cells.EntireRow.Delete
.Range("B1:C1").Merge
With .Range("B1")
.Value = "First"
.HorizontalAlignment = xlCenter
End With
.Range("D1:E1").Merge
With .Range("D1")
.Value = "Last"
.HorizontalAlignment = xlCenter
End With
.Range("A2").Value = "String"
.Range("B2").Value = "File"
.Range("C2").Value = "Line"
.Range("D2").Value = "File"
.Range("E2").Value = "Line"
.Range("B2:E2").HorizontalAlignment = xlRight
.Range("A1:E2").Font.Bold = True
RowDiagNext = 3
.Cells(RowDiagNext, 1).Select
End With
ActiveWindow.FreezePanes = False
ActiveWindow.FreezePanes = True
PathCrnt = ActiveWorkbook.Path
Set FSO = CreateObject("Scripting.FileSystemObject")
BlockLen = 1000000
Set FileIn = FSO.OpenTextFile(PathCrnt & "\" & FileInNameRoot & ".txt", 1, 0)
' 1 = Read. 0 = ASCII file
NumFileOut = 0
TrailingFromLastBlock = ""
Do While FileIn.AtEndOfStream <> True
Block = TrailingFromLastBlock & FileIn.read(BlockLen)
Do While True
' Ensure block not split in middle of a string of control characters
If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
FileIn.AtEndOfStream <> True Then
' The last character of block is a control character. Get another
Block = Block & FileIn.read(1)
Else
Exit Do
End If
Loop
With Worksheets("DiagInfo")
NumFileOut = NumFileOut + 1
.Cells(RowDiagNext, 2).Value = NumFileOut
NumLine = 1
.Cells(RowDiagNext, 3).Value = NumLine
End With
Set FileOut = FSO.CreateTextFile(PathCrnt & "\" & FileOutNameRoot & " " & _
Right("000" & NumFileOut, 3) & ".txt", True, False)
' True = Can overwrite. False = ASCII
PosStart = 1 ' Start of first line
PosCrnt = 1
Do While PosCrnt <= Len(Block)
If Mid(Block, PosCrnt, 1) < " " Or _
Mid(Block, PosCrnt, 1) = Chr(127) Then
' Have found a control character.
LineOut = Mid(Block, PosStart, PosCrnt - PosStart)
' Build display string of control character and
' any subsequent control characters.
CtrlChrStg = ""
Do While True
CtrlChrStg = CtrlChrStg & " " & Asc(Mid(Block, PosCrnt, 1))
PosCrnt = PosCrnt + 1
If PosCrnt > Len(Block) Then
' This block finished
Exit Do
End If
If Mid(Block, PosCrnt, 1) < " " Or _
Mid(Block, PosCrnt, 1) = Chr(127) Then
' Another control character
Else
' First display character of next line
Exit Do
End If
Loop
' Search for control character string in worksheet DiagInfo
With Worksheets("DiagInfo")
Found = False
For RowDiagCrnt = 3 To RowDiagNext - 1
If .Cells(RowDiagCrnt, 1).Value = CtrlChrStg Then
Found = True
Exit For
End If
Next
If Not Found Then
' Previously unknown string of control characters
RowDiagCrnt = RowDiagNext
RowDiagNext = RowDiagNext + 1
.Cells(RowDiagNext, 1).Select
.Cells(RowDiagCrnt, 1).Value = "'" & CtrlChrStg
' First occurrence
.Cells(RowDiagCrnt, 2).Value = NumFileOut
.Cells(RowDiagCrnt, 3).Value = NumLine
End If
' Last occurrence
.Cells(RowDiagCrnt, 4).Value = NumFileOut
.Cells(RowDiagCrnt, 5).Value = NumLine
End With
FileOut.writeline Right("00000" & NumLine, 6) & " " & _
LineOut & CtrlChrStg
PosStart = PosCrnt ' Start of current line
NumLine = NumLine + 1
If NumLine Mod 100 = 0 Then
With Worksheets("DiagInfo")
.Cells(RowDiagNext, 2).Value = NumFileOut
.Cells(RowDiagNext, 3).Value = NumLine
End With
End If
Else
PosCrnt = PosCrnt + 1
End If
Loop
FileOut.Close
' Save trailing characters for next line
TrailingFromLastBlock = Mid(Block, PosStart, Len(Block) - PosStart + 1)
Loop
FileIn.Close
With Worksheets("DiagInfo")
.Cells(RowDiagNext, 2).Value = ""
.Cells(RowDiagNext, 3).Value = ""
.Cells(3, 1).Select
.Cells.Columns.AutoFit
End With
Debug.Print Timer - StartTime
End Sub
修订解决方案
对分析结果的回顾显示真正的问题是:
文本中也有标签,但提问者认为这些不是问题而是要保留。提问者希望删除空白行,并用空格替换换行符。
以下例程以100,000字节为单位读取输入文件。更新长字符串会产生大量开销。有限的实验表明100,000是可接受的妥协。如果块的最后一个字符是控制字符,则例程循环向块添加另一个字符,直到最后一个字符不是控制字符。这样可确保不会在两个块之间拆分控制字符序列。例程首先循环将CR LF CR LF
替换为CR LF
,直到没有空行。然后,例程查找LF
之前没有CR
的{{1}}。发现的任何内容都被空格所取代。在具有大量空白行和额外LF
s的63 Mb文件上,例程需要22秒才能完成其任务。
需要改变的唯一陈述是常规的顶部。
Option Explicit
Sub RemoveUnwantedCtrlChars()
Dim Block As String
Dim BlockLen As Long
Dim FileIn As Object
Dim FileInName As String
Dim FileOut As Object
Dim FileOutName As String
Dim FSO As Object
Dim PathCrnt As String
Dim PosCRLF As Long
Dim PosLF As Long
Dim PosLastCRLF As Long
Dim PosLastLF As Long
Dim StartTime As Single
StartTime = Timer
' ## This assumes the input file is in the same folder
' ## as the workbook containing this macro.
PathCrnt = ActiveWorkbook.Path
' ###### Replace names as required.
FileInName = "TestSplitLine In.txt"
FileOutName = "TestSplitLine Out.txt"
Set FSO = CreateObject("Scripting.FileSystemObject")
BlockLen = 100000
Set FileIn = FSO.OpenTextFile(PathCrnt & "\" & FileInName, 1, 0)
' 1 = Read. 0 = ASCII file
Set FileOut = FSO.CreateTextFile(PathCrnt & "\" & FileOutName, True, False)
' True = Can overwrite. False = ASCII
Do While FileIn.AtEndOfStream <> True
Block = FileIn.Read(BlockLen)
Do While True
' Ensure block not split in middle of a string of control characters
If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
FileIn.AtEndOfStream <> True Then
' The last character of block is a control character. Get another
' character
Block = Block & FileIn.Read(1)
Else
Exit Do
End If
Loop
' Remove all blank lines
Do While InStr(1, Block, vbCr & vbLf & vbCr & vbLf) <> 0
Block = Replace(Block, vbCr & vbLf & vbCr & vbLf, vbCr & vbLf)
Loop
' Find all lone LFs and replace by " "
PosLF = 1
PosCRLF = 1
Do While True
PosLastLF = PosLF
PosLastCRLF = PosCRLF
PosLF = InStr(PosLF, Block, vbLf)
PosCRLF = InStr(PosCRLF, Block, vbCr & vbLf)
If PosLF = 0 Then
' No more LFs in this block
Exit Do
ElseIf PosCRLF <> 0 And PosLF > PosCRLF Then
' Have LF of CR LF. No action required
PosLF = PosLF + 1
PosCRLF = PosLF
Else
' Have a lone LF
Block = Mid(Block, 1, PosLF - 1) & " " & Mid(Block, PosLF + 1)
' Move CRLF pointer back to position of replaced LF
PosCRLF = PosLF
End If
Loop
PosLF = 1
FileOut.write Block
Loop
FileIn.Close
FileOut.Close
Debug.Print Timer - StartTime
End Sub
答案 1 :(得分:0)
Notepad ++识别那些[]回车。它是我发现的唯一一个可以进行搜索和编辑的编辑器。替换他们。我在导入之前用它来清理一些非常大的.txt文件。
答案 2 :(得分:0)
修剪功能?这将是:
=TRIM(A1:D12)
这将修剪线后的所有空格......如果我记得我的基础知识。那就是你最终要找的......