清理大文本文件

时间:2012-03-28 17:17:18

标签: database excel ms-access

在导入MS Access之前,我尝试了很多脚本和方法来清理大文本文件。

文本文件为500k +行。有些行包含“回车”或“换行符”。这些在记事本中显示为方形符号。 (有趣的是在Windows XP中它们是正方形,但在Windows 2003中它们不会出现在记事本中,但会将文本分解到下一行/行。

每个字段都不应出现这些字段。因此,我需要一种从文件中删除所有这些内容的方法。

文本文件内容示例:

 FIELD_NAME1|FIELD_NAME2                         |FIELD_NAME3
 John       |He likes food                       |1002
 Jake       |He eats food                        |1004
 Jake       |He eats food and [][] likes swimming|1003

1)一种解决方案是读取文件并修复行。无论如何难以实现这一点。通常,您只会根据后续行中的错误意识到该行是错误的。

2)另一种方法是将文本文件拆分为较小的位。然后使用查找和替换。一旦清理完 - 再粘在一起进入MS Access。

有一种简单的方法吗?

此任务只需运行几次,因此自动化并不重要。

分析输出由dmuk添加,然后由Tony Dallimore编辑

请参阅我的(Tony Dallimore)答案,了解该分析输出的解释。我没想到会找到这么长的控制字符串(例如由44个空行引起)。我在第1列中包含了这些长字符串以提高可读性。

String         |       File    |       Line    |       File    |       Line
 13 10         |       1       |       1       |       376     |       626
 9             |       1       |       2299    |       375     |       3524
 9 9           |       3       |       6106    |       67      |       6111
 9 9 9 9       |       6       |       1916    |       53      |       1492
 9 9 9         |       6       |       1917    |       53      |       1493
 9 9 9 9 9     |       42      |       1266    |       42      |       1266
 10            |       69      |       1524    |       240     |       4885
 10 10         |       69      |       3577    |       222     |       4651
 13 10 13 10   |       71      |       3697    |       374     |       3258
 13 10 10      |       80      |       5440    |       240     |       4166
 13 10 13 10 13|       81      |       2657    |       290     |       2094
 10 13 10      |               |               |               |
 13 10 13 10 13|       81      |       2662    |       215     |       1802
 10            |               |               |               |
 13 10 13 10 10|       86      |       2082    |       86      |       6914
 10 10 10      |       88      |       1314    |       221     |       4754
 9 10          |       94      |       246     |       94      |       246
 13 10 13 10 13|       126     |       1699    |       126     |       1699
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|       143     |       2078    |       143     |       2078
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10   |               |               |               |
 10 10 10 10   |       182     |       1846    |       188     |        2663
 10 10 10 10 10|       195     |       3320    |       195     |        3320
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10   |               |               |               |
 13 10 13 10 13|       198     |       4223    |       198     |       4223
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10      |       198     |       4223    |       198     |       4223
 10 10 10 10 10|       213     |       5449    |       213     |       5449
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10            |               |               |               |
 13 10 13 10 13|       278     |       788     |       278     |       788
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10   |               |               |               |

3 个答案:

答案 0 :(得分:2)

<强>简介

起初看来问题是额外的回车。第一个解决方案(已删除)搜索单个CR并将其删除。这没有任何有益效果,因此很明显问题不是额外的回车。我提供了下面的分析代码,以便我们能够正确评估真实情况。此分析例程的输出已添加到原始问题中。对这一输出的回顾揭示了真正的问题:

  • 大量空行。
  • 额外换行。

基于这些发现的修订解决方案低于分析代码。

<强>分析

您需要在模块中包含以下代码。该例程需要一个名为“DiagInfo”的工作表。

代码循环从输入文件中读取大约1 Mb的块。它将每个块拆分成行,任何控制字符都用作行终止符。它为每个块创建一个输出文件。

在例程的顶部附近,你会发现:

  ' ###### Replace names as required
  FileInNameRoot = "TestSplitLine In"
  FileOutNameRoot = "TestSplitLine Out"

输入文件为:FileInNameRoot & ".txt"

输出文件的名称为:FileOutNameRoot & " 001.txt"FileOutNameRoot & " 002.txt"FileOutNameRoot & " 003.txt"等。

如果您愿意,可以将块大小从1 Mb更改。块例程为1,000,000,但例程速度非常快,但输出文件的数量要多十倍。我发现1 Mb为我提供了可以使用NotePad轻松访问的文件。

输出如下:

000001 FIELD_NAME1|FIELD_NAME2|FIELD_NAME3  13 10
000002 John|He likes food|1002  13 10
000003 Jake|He eats food|1004  13 10
000004 Jake|He eats food and  13
000005 likes swimming|1003 13 10
000006 John|He likes food|1002  13 10
000007 Jake|He eats food|1004  13 10
000008 Jake|He eats food and  20 27 0 4

前七个字符是后跟空格的行号。一行由任何控制字符结束。输入文件中的显示字符输出不变。每个控制字符作为空格输出,后跟其代码值。大多数线路由13 10(CR LF)终止,但线路4由13(CR)终止,线路8由20 27 0 4(DC4 ESC NUL EOT)终止。

工作表“DiagInfo”看起来像:

               First          Last  
String      File   Line    File   Line
 13 10        1       1     66    5786
 13           1       4     66    5666
 20 27 0 4    1       8     66    5670

A列包含例程找到的每个不同控制字符串。列B和C包含第一次出现的文件和行号。列D和E包含最后一次出现的文件和行号。

例程使用工作表“DiagInfo”作为原始进度指示器,最后一行显示当前输出文件编号,最后一行编号是100的倍数。对于我的63Mb测试文件,例程需要2分钟。 / p>

这将告诉我们我们正在处理什么,并允许我们做出相应的计划。

Option Explicit 
Sub AnalyseFileAndSplitIntoBlocks()

  Dim Block As String
  Dim BlockLen As Long
  Dim CtrlChr As Long
  Dim CtrlChrStg As String
  Dim FileIn As Object
  Dim FileInNameRoot As String
  Dim FileOut As Object
  Dim FileOutNameRoot As String
  Dim Found As Boolean
  Dim FSO As Object
  Dim LineOut As String
  Dim NumFileOut As Long
  Dim NumLine As Long
  Dim PathCrnt As String
  Dim PosCrnt As Long
  Dim PosStart As Long
  Dim RowDiagCrnt As Long
  Dim RowDiagNext As Long
  Dim StartTime As Single
  Dim TrailingFromLastBlock As String

  StartTime = Timer

  ' ###### Replace names as required
  FileInNameRoot = "TestSplitLine In"
  FileOutNameRoot = "TestSplitLine Out"

  With Worksheets("DiagInfo")
    .Activate
    .Cells.EntireRow.Delete
    .Range("B1:C1").Merge
    With .Range("B1")
      .Value = "First"
      .HorizontalAlignment = xlCenter
    End With
    .Range("D1:E1").Merge
    With .Range("D1")
      .Value = "Last"
      .HorizontalAlignment = xlCenter
    End With
    .Range("A2").Value = "String"
    .Range("B2").Value = "File"
    .Range("C2").Value = "Line"
    .Range("D2").Value = "File"
    .Range("E2").Value = "Line"
    .Range("B2:E2").HorizontalAlignment = xlRight
    .Range("A1:E2").Font.Bold = True
    RowDiagNext = 3
    .Cells(RowDiagNext, 1).Select
  End With
  ActiveWindow.FreezePanes = False
  ActiveWindow.FreezePanes = True

  PathCrnt = ActiveWorkbook.Path
  Set FSO = CreateObject("Scripting.FileSystemObject")
  BlockLen = 1000000

  Set FileIn = FSO.OpenTextFile(PathCrnt & "\" & FileInNameRoot & ".txt", 1, 0)
  '  1 = Read.  0 = ASCII file

  NumFileOut = 0
  TrailingFromLastBlock = ""

  Do While FileIn.AtEndOfStream <> True
    Block = TrailingFromLastBlock & FileIn.read(BlockLen)
    Do While True
      ' Ensure block not split in middle of a string of control characters
      If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
                                         FileIn.AtEndOfStream <> True Then
        ' The last character of block is a control character.  Get another
        Block = Block & FileIn.read(1)
      Else
        Exit Do
      End If
    Loop

    With Worksheets("DiagInfo")
      NumFileOut = NumFileOut + 1
      .Cells(RowDiagNext, 2).Value = NumFileOut
      NumLine = 1
      .Cells(RowDiagNext, 3).Value = NumLine
    End With

    Set FileOut = FSO.CreateTextFile(PathCrnt & "\" & FileOutNameRoot & " " & _
                            Right("000" & NumFileOut, 3) & ".txt", True, False)
    ' True = Can overwrite.  False = ASCII

    PosStart = 1        ' Start of first line
    PosCrnt = 1
    Do While PosCrnt <= Len(Block)
      If Mid(Block, PosCrnt, 1) < " " Or _
         Mid(Block, PosCrnt, 1) = Chr(127) Then
        ' Have found a control character.
        LineOut = Mid(Block, PosStart, PosCrnt - PosStart)
        ' Build display string of control character and
        ' any subsequent control characters.
        CtrlChrStg = ""
        Do While True
          CtrlChrStg = CtrlChrStg & " " & Asc(Mid(Block, PosCrnt, 1))
          PosCrnt = PosCrnt + 1
          If PosCrnt > Len(Block) Then
            ' This block finished
            Exit Do
          End If
          If Mid(Block, PosCrnt, 1) < " " Or _
             Mid(Block, PosCrnt, 1) = Chr(127) Then
            ' Another control character
          Else
            ' First display character of next line
            Exit Do
          End If
        Loop
        ' Search for control character string in worksheet DiagInfo
        With Worksheets("DiagInfo")
          Found = False
          For RowDiagCrnt = 3 To RowDiagNext - 1
            If .Cells(RowDiagCrnt, 1).Value = CtrlChrStg Then
              Found = True
              Exit For
            End If
          Next
          If Not Found Then
            ' Previously unknown string of control characters
            RowDiagCrnt = RowDiagNext
            RowDiagNext = RowDiagNext + 1
            .Cells(RowDiagNext, 1).Select
            .Cells(RowDiagCrnt, 1).Value = "'" & CtrlChrStg
            ' First occurrence
            .Cells(RowDiagCrnt, 2).Value = NumFileOut
            .Cells(RowDiagCrnt, 3).Value = NumLine
          End If
          ' Last occurrence
          .Cells(RowDiagCrnt, 4).Value = NumFileOut
          .Cells(RowDiagCrnt, 5).Value = NumLine
        End With
        FileOut.writeline Right("00000" & NumLine, 6) & " " & _
                                                     LineOut & CtrlChrStg
        PosStart = PosCrnt          ' Start of current line
        NumLine = NumLine + 1
        If NumLine Mod 100 = 0 Then
          With Worksheets("DiagInfo")
           .Cells(RowDiagNext, 2).Value = NumFileOut
           .Cells(RowDiagNext, 3).Value = NumLine
          End With
        End If
      Else
        PosCrnt = PosCrnt + 1
      End If
    Loop
    FileOut.Close
    ' Save trailing characters for next line
    TrailingFromLastBlock = Mid(Block, PosStart, Len(Block) - PosStart + 1)
  Loop

  FileIn.Close

  With Worksheets("DiagInfo")
    .Cells(RowDiagNext, 2).Value = ""
    .Cells(RowDiagNext, 3).Value = ""
    .Cells(3, 1).Select
    .Cells.Columns.AutoFit
  End With

  Debug.Print Timer - StartTime

End Sub

修订解决方案

对分析结果的回顾显示真正的问题是:

  • 大量空行。
  • 额外换行。

文本中也有标签,但提问者认为这些不是问题而是要保留。提问者希望删除空白行,并用空格替换换行符。

以下例程以100,000字节为单位读取输入文件。更新长字符串会产生大量开销。有限的实验表明100,000是可接受的妥协。如果块的最后一个字符是控制字符,则例程循环向块添加另一个字符,直到最后一个字符不是控制字符。这样可确保不会在两个块之间拆分控制字符序列。例程首先循环将CR LF CR LF替换为CR LF,直到没有空行。然后,例程查找LF之前没有CR的{​​{1}}。发现的任何内容都被空格所取代。在具有大量空白行和额外LF s的63 Mb文件上,例程需要22秒才能完成其任务。

需要改变的唯一陈述是常规的顶部。

Option Explicit
Sub RemoveUnwantedCtrlChars()

  Dim Block As String
  Dim BlockLen As Long
  Dim FileIn As Object
  Dim FileInName As String
  Dim FileOut As Object
  Dim FileOutName As String
  Dim FSO As Object
  Dim PathCrnt As String
  Dim PosCRLF As Long
  Dim PosLF As Long
  Dim PosLastCRLF As Long
  Dim PosLastLF As Long
  Dim StartTime As Single

  StartTime = Timer

  ' ## This assumes the input file is in the same folder
  ' ## as the workbook containing this macro.
  PathCrnt = ActiveWorkbook.Path

  ' ###### Replace names as required.
  FileInName = "TestSplitLine In.txt"
  FileOutName = "TestSplitLine Out.txt"

  Set FSO = CreateObject("Scripting.FileSystemObject")
  BlockLen = 100000

  Set FileIn = FSO.OpenTextFile(PathCrnt & "\" & FileInName, 1, 0)
  '  1 = Read.  0 = ASCII file

  Set FileOut = FSO.CreateTextFile(PathCrnt & "\" & FileOutName, True, False)
  ' True = Can overwrite.  False = ASCII

  Do While FileIn.AtEndOfStream <> True
    Block = FileIn.Read(BlockLen)
    Do While True
      ' Ensure block not split in middle of a string of control characters
      If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
                                         FileIn.AtEndOfStream <> True Then
        ' The last character of block is a control character.  Get another
        ' character
        Block = Block & FileIn.Read(1)
      Else
        Exit Do
      End If
    Loop
    ' Remove all blank lines
    Do While InStr(1, Block, vbCr & vbLf & vbCr & vbLf) <> 0
      Block = Replace(Block, vbCr & vbLf & vbCr & vbLf, vbCr & vbLf)
    Loop
    ' Find all lone LFs and replace by " "
    PosLF = 1
    PosCRLF = 1
    Do While True
      PosLastLF = PosLF
      PosLastCRLF = PosCRLF
      PosLF = InStr(PosLF, Block, vbLf)
      PosCRLF = InStr(PosCRLF, Block, vbCr & vbLf)
      If PosLF = 0 Then
        ' No more LFs in this block
        Exit Do
      ElseIf PosCRLF <> 0 And PosLF > PosCRLF Then
        ' Have LF of CR LF.  No action required
        PosLF = PosLF + 1
        PosCRLF = PosLF
      Else
        ' Have a lone LF
        Block = Mid(Block, 1, PosLF - 1) & " " & Mid(Block, PosLF + 1)
        ' Move CRLF pointer back to position of replaced LF
        PosCRLF = PosLF
      End If
    Loop
    PosLF = 1
    FileOut.write Block
  Loop

  FileIn.Close
  FileOut.Close

  Debug.Print Timer - StartTime

End Sub

答案 1 :(得分:0)

Notepad ++识别那些[]回车。它是我发现的唯一一个可以进行搜索和编辑的编辑器。替换他们。我在导入之前用它来清理一些非常大的.txt文件。

这是免费赠品:http://notepad-plus-plus.org/download/v6.0.html

答案 2 :(得分:0)

修剪功能?这将是:

=TRIM(A1:D12)

这将修剪线后的所有空格......如果我记得我的基础知识。那就是你最终要找的......