使用vba解析excel文件时出现问题

时间:2012-01-30 12:32:29

标签: regex excel parsing vba excel-vba

我是VBA新手并且遇到以下问题。

我获得了一个带有门牌号的excel文件,其中每个房子都有相应的公寓号码。这里棘手的是没有结构化的方式来显示这些数据。

这是我的意思,对于前。我在单元格(4:ABC)中有房号N10,在这个单元格下面我有随机数的平面数字。房屋号码11位于单元格(4:DEF)中,同样具有随机数量的单位。我需要将这些数据带入结构化的方式,其中平面的数量将位于一列中,相应的门牌号位于下一列中。另一个问题是,在一个文档中,大约有15张具有相似数据的纸张,并且周围有20个这样的excel文件。因此,需要将大量数据合并到一个结构化文档中。

我不知道如何开始。解析这些数据非常困难,尽管我假设我必须使用正则表达式。有用的是门牌号由2位数组成,而平号几乎都是相同的格式 - (5位数字和一个字母)。所以我假设我可以通过每个单元格和那些匹配正则表达式的文章写入新文档但是那么会出现与门牌号匹配的问题?拜托,有人吗?任何想法......

我的输入:

   |  A   |  B   |  C   |  D   |  E   |  F   |       | ...   | N    |
  1|             Header                      |       | ...   |      |
  2|             Header  N2                  |       | ...   |      |
  3|             Header N3                   |       | ...   |      |
  4|         N9/10      |      |      N11/12         | ...
  5|Smith               |      |Jones |Tim   |       | ...
  6|Green               |      |Singh |Roth  |       | ...
  7|Abbott              |      |Patel |              | ...
  8|11111a|22222a|33333a|      |22222a|33333a| 44444c|
  9|11111b|22222b|33333b|      |22222b|33333b| 44443d|
   :
 21|11111u|22222u|33333u|      |22222u|33333u| 44444e|
 22|      |22222v|33333v|      |22222v|33333v| 77777e|
 23|      |      |33333w|      |      |      |       | 
 24|      |      |33333x|      |      |      |       |
   :
                B L A N K          CELLS
   .                                            .
   .                                            .
 31|       N375/376     |      |  N96/85     |
 32|Smith               |      |Jones |Tim   |       | ...
 33|Green               |      |Singh |Roth  |       | ...
 34|Abbott              |      |Patel |              | ...
 35|11111a|22222a|33333a|      |22222a|33333a| 44444c|
 36|11111b|22222b|33333b|      |22222b|33333b| 44443d|
   :
 45|11111u|22222u|33333u|      |22222u|33333u| 44444e|
 46|      |22222v|33333v|      |22222v|33333v| 77777e|
 47|      |      |33333w|      |      |33333w| 

您的问题的答案:

  1. 在第1行至第3行中,只有文档标题,城市名称,区域(根本不重要)
  2. 姓氏与单位数量不符,我不知道为什么。这个信息对我来说并不重要
  3. 第25-30行 - 空白,45-50空白,61-64 - 空白,69-78 - 空白,在某些情况下88 - 106也是空白,在某些情况下文档以87结尾
  4. 行不同,这就是问题所在。例如,你可以看到在11/12房子下面的行有相似的长度,但这只是一个巧合。每个房子下面没有一定长度的行
  5. 是的,有3个空白单元格。
  6. 我注意到在所有文档中使用的最后一列是N
  7. 我忽略了姓氏,因为我不需要他们
  8. 我想要的是什么:

     |  A   |  B   |  C   | 
    1|11111a|N9/10 |      | 
    2|11111b|N9/10 |      |
    3|11111c|N9/10 |      | 
     : 
    x|11111a|N11/12|      |
    x|11111b|N11/12|      |
    x|11111c|N11/12|      |
    

1 个答案:

答案 0 :(得分:0)

这应该让你开始。我发现很难测试我的代码,但我希望没有太多的错误。我已经包含了解释每个部分目的的评论,但我不知道你是否了解足够的VBA以了解我是如何实现这一目的的。

我需要您将要从中提取数据的所有工作簿(源工作簿)移动(或复制)到同一文件夹。如果该文件夹仅包含源工作簿和我要求您创建的工作簿,那么我的第一个宏将更好用。

在该文件夹中,我需要您创建一个名为“Consolidate.xls”的新工作簿。 (我使用的是Excel 2003,您可能有不同的扩展名。)

打开Consolidate.xls,然后打开VB编辑器。 (我假设你已经掌握了Excel宏的基本知识。如果有必要,你将不得不等我起床。你比我早6个小时,我退休了所以我没有起床。)

创建一个模块并将下面的代码复制到它。

运行宏FillSourceSheets。这将创建一个工作表“SourceSheets”并用工作簿和工作表名称填充它。在我的系统上,它看起来像:

 |   A  |       B             |     C     |  D   |   E  | 
 |Status|Source workbook      |Source worksheets -->    |       
 |      |Consolidate.txt      |Consolidate|      |      |
 |      |Test Parse data 1.xls|Sheet1     |Sheet2|Sheet3|
 |      |Test Parse data 2.xls|Sheet1     |Sheet4|Sheet2|Sheet3|

暂时不要担心“状态”列。

“源工作簿”下的名称是Excel可以打开的文件。请注意,它可以打开文本文件。 “Test Parse data 1.xls”和“Test Parse data 2.xls”是我的测试工作簿。每个文件中的工作表都列在找到的序列中。

您需要删除不是源文件的文件的任何行。我不得不删除“Consolidate.txt”的行。您需要删除任何不包含源数据的工作表的名称。在我的情况下,“Test Parse data 2.xls”的“Sheet4”不包含源数据,我不得不删除它。他们一定不能有任何差距。列B中没有任何内容的行结束列表。没有任何内容的单元格会结束一行。因此,在编辑我的工作表后,它看起来像:

 |   A  |       B             |     C     |  D   |   E  | 
 |Status|Source workbook      |Source worksheets -->    |       
 |      |Test Parse data 1.xls|Sheet1     |Sheet2|Sheet3|
 |      |Test Parse data 2.xls|Sheet1     |Sheet2|Sheet3|

此列表是驱动其他宏的原因。它告诉他们要查看哪些工作簿和工作表。

我包含的另一个宏是ValidateSheets

您有20个工作簿,其中一个包含15个工作表。如果没有错误会很奇怪,如果有些不符合你给我的格式,那将是惊人的。 ValidateSheets做了第一级验证,因为在我们确切知道我们拥有的东西之前没有任何目的。

ValidateSheets在“SourceSheets”中查找工作簿和工作表列表。它会创建一个文本文件“Process Report.txt”,列出其进度。如果它在“SourceSheets”中发现错误,它会立即停止并在屏幕上显示一条消息。 “SourceSheets”中不应出现任何错误,但如果有必要修复它们,请重新启动宏。如果它在工作表中发现错误,它会向“Process Report.txt”输出一条错误消息并继续下一个工作表。

尝试运行ValidateSheets,看看您取得了哪些进展。

Option Explicit
Sub FillSourceSheets()

  Dim ColCrnt As Long
  Dim ErrMsg As String
  Dim Filename As String
  Dim InxWSheet As Long
  Dim PathCrnt As String
  Dim RowCrnt As Long
  Dim WBookOther As Workbook
  Dim WBookThis As Workbook

  If Workbooks.Count > 1 Then
    ' It is easy to get into a muddle if there are multiple workbooks
    ' open at the start of a macro like this.  Avoid the problem.
    Call MsgBox("Please close all other workbooks", vbOKOnly)
    Exit Sub
  End If

  ' Record this workbook so we do not confuse it with any we open
  Set WBookThis = ActiveWorkbook
  ' Record the folder containing the current workbook
  PathCrnt = ActiveWorkbook.Path

  ' Create a new worksheet, name it SourceSheets and fill the heading row
  Sheets.Add
  With ActiveSheet
    .Name = "SourceSheets"
    .Range("A1").Value = "Status"
    .Range("B1").Value = "Source workbook"
    .Range("C1").Value = "Source worksheets -->"
    .Range("C1:E1").MergeCells = True
    .Range("A1:C1").Font.Bold = True
  End With
  RowCrnt = 2

  Filename = Dir$(PathCrnt & "\*.*")
  ' Loop for every file in the activeworkbook's folder
  Do While Filename <> ""
    If Filename <> ActiveWorkbook.Name Then
      ' This file is not the active workbook
      ' so try to open it as a workbook.
      Err.Clear
      ErrMsg = ""
      On Error Resume Next
      Set WBookOther = Workbooks.Open(PathCrnt & "\" & Filename)
      If Err.Number <> 0 Then
        ' On Error GoTo 0 clears Err.Num and Err.Description so save
        ErrMsg = Err.Number & " " & Err.Description
      End If
      On Error GoTo 0
      If ErrMsg <> "" Then
        ' This file cannot be opened by Excel
        Debug.Print Filename & " " & ErrMsg
      Else
        ' This file has been successfully opened.  Create a row for it.
        ' Start by placing the file name in column 2.
        WBookThis.Sheets("SourceSheets").Cells(RowCrnt, 2).Value = Filename
        ColCrnt = 3
        ' Place each sheet name in a cell starting from column 3
        For InxWSheet = 1 To WBookOther.Worksheets.Count
          WBookThis.Sheets("SourceSheets").Cells(RowCrnt, ColCrnt).Value = _
                                    WBookOther.Worksheets(InxWSheet).Name
          ColCrnt = ColCrnt + 1
        Next
        WBookOther.Close SaveChanges:=False
        RowCrnt = RowCrnt + 1
      End If
    End If
    Filename = Dir$       ' Get next file name
  Loop
  With WBookThis.Sheets("SourceSheets")
    .Columns.AutoFit
  End With

End Sub
Sub ValidateSheets()

  Dim CellValue As String
  Dim ColSrcList As Long
  Dim ColSrcSheetCrnt As Long
  Dim ColSrcSheetLast As Long
  Dim Found As Boolean
  Dim InxWSheetCrnt As Long
  Dim OutputFileNum As Integer
  Dim PathCrnt As String
  Dim Rng As Range
  Dim RowSrcList As Long
  Dim RowSrcSheetBlockStart As Long
  Dim RowSrcSheetCrnt As Long
  Dim RowSrcSheetFinal As Long
  Dim WBookOtherNameCrnt As String
  Dim WSheetOtherNameCrnt As String
  Dim WBookOther As Workbook
  Dim WBookThis As Workbook

  If Workbooks.Count > 1 Then
    ' It is easy to get into a muddle if there are multiple workbooks
    ' open at the start of a macro like this.  Avoid the problem.
    Call MsgBox("Please close all other workbooks", vbOKOnly)
    Exit Sub
  End If

  ' Record this workbook so we do not confuse it with any we open
  Set WBookThis = ActiveWorkbook
  ' Record the folder containing the current workbook
  PathCrnt = ActiveWorkbook.Path
  ' Open text file to which progress messages will be written
  OutputFileNum = FreeFile
  Open PathCrnt & "\Process Report.txt" For Output Lock Write As #OutputFileNum

  With WBookThis.Sheets("SourceSheets")
    ' Load name of first workbook and first worksheet
    RowSrcList = 2      ' Row of first workbook
    ColSrcList = 3      ' Column of first worksheet
    WBookOtherNameCrnt = .Cells(RowSrcList, 2).Value
    WSheetOtherNameCrnt = .Cells(RowSrcList, ColSrcList).Value
  End With

  ' This loop repeats for each worksheet listed in worksheet SourceSheets
  Do While True
    ' WBookOtherNameCrnt and WSheetOtherNameCrnt have been loaded either
    ' before this loop or by the code at the end
    If Not WBookOther Is Nothing Then
      ' There is an open workbook.  Check it is the one
      ' required for this loop.
      If LCase(WBookOtherNameCrnt) <> LCase(WBookOther.Name) Then
        ' This is not the same workbook.
        ' Close the open workbook and clear reference to it
        WBookOther.Close SaveChanges:=False
        Set WBookOther = Nothing
      End If
    End If
    If WBookOther Is Nothing Then
      ' The workbook to be tested is not open so we need to
      ' open it.  First check it exists
      If Dir$(PathCrnt & "\" & WBookOtherNameCrnt) <> "" Then
        ' The specified file exists but it may not be a valid workbook.
        ' Use Excel's error handling
        Err.Clear
        On Error Resume Next
        Set WBookOther = Workbooks.Open(PathCrnt & "\" & WBookOtherNameCrnt)
        On Error GoTo 0
        If Err.Number <> 0 Then
          Call MsgBox("Open of """ & WBookOtherNameCrnt & """ failed. " & _
                  "Error: " & Err.Number & " " & Err.Description, vbOKOnly)
          Set WBookOther = Nothing
          Close OutputFileNum       ' Close text file
          Exit Sub
        End If
      Else
        Call MsgBox("I could not find workbook """ & WBookOtherNameCrnt _
                                     & """", vbOKOnly)
        Close OutputFileNum       ' Close text file
        Exit Sub
      End If
    End If
    ' The required workbook is open.
    With WBookOther

      ' Check the worksheet exists
      Found = False
      For InxWSheetCrnt = 1 To .Worksheets.Count
        If .Worksheets(InxWSheetCrnt).Name = WSheetOtherNameCrnt Then
          Found = True
          Exit For
        End If
      Next
      If Not Found Then
        ' The workbook was not found
        Call MsgBox("I could not find worksheet """ & WSheetOtherNameCrnt _
             & """ with workbook """ & WBookOtherNameCrnt & """", vbOKOnly)
        .Close
        Close OutputFileNum       ' Close text file
        Exit Sub
      End If

      Print #OutputFileNum, "Process sheet """ & WSheetOtherNameCrnt & _
            """ of workbook """ & WBookOther.Name & """"

      With Sheets(WSheetOtherNameCrnt)

        ' Validate source sheet matches expected format
        ' Rows 1 to 3 are ignored.

        ' Find final row of sheet
        Set Rng = .Cells.Find("*", .Range("A1"), _
                                       xlFormulas, , xlByRows, xlPrevious)
        If Rng Is Nothing Then
          ' The sheet is empty
          Print #OutputFileNum, "  Sheet is empty"
          Exit Do
        End If
        RowSrcSheetFinal = Rng.Row

        ' There are one of more blocks.  The first block starts in Row 4

        RowSrcSheetBlockStart = 4
        Do While True       ' Loop for each block
          ' Row 1 of a block must consider of one or more three cell merged
          ' areas.  Each merged area contains a string with value of format:
          ' "N" number "/" number.
          ' Search backwards from the column 1 of the next row
          ' for a cell with a value
          Set Rng = .Cells.Find("*", .Cells(RowSrcSheetBlockStart + 1, 1), _
                                            xlFormulas, , xlByRows, xlPrevious)
          If Rng Is Nothing Then
            ' This should not be possible because have
            ' already check for empty sheet
            Print #OutputFileNum, "  Sheet is empty"
            Exit Do
          End If
          If Rng.Row <> RowSrcSheetBlockStart Then
            Print #OutputFileNum, "  I was expecting a value on row " & _
                                                           RowSrcSheetBlockStart
            Exit Do
          End If
          ColSrcSheetLast = Rng.Column
          For ColSrcSheetCrnt = 1 To ColSrcSheetLast Step 3
            ' Check the three cells are merged
            If .Range(.Cells(RowSrcSheetBlockStart, ColSrcSheetCrnt), _
                 .Cells(RowSrcSheetBlockStart, ColSrcSheetCrnt + 2)).MergeCells _
                                                               = True Then
              If Not .Cells(RowSrcSheetBlockStart, ColSrcSheetCrnt).Value Like "N*/*" Then
                ' Cell does not contain "N" number "/" number
                Print #OutputFileNum, "  Row " & RowSrcSheetBlockStart & _
                        " is the start of a block. I was expecting " & _
                        "columns " & ColNumToCode(ColSrcSheetCrnt) & " to " & _
                        ColNumToCode(ColSrcSheetCrnt + 2) & " to contain a value with " & _
                        "the format ""N"" number ""/"" number"
                Exit Do
              End If
            Else
              ' Three cells are not merged
              Print #OutputFileNum, "  Row " & RowSrcSheetBlockStart & _
                    " is the start of a block. I was expecting " & _
                    "columns " & ColNumToCode(ColSrcSheetCrnt) & " to " & _
                    ColNumToCode(ColSrcSheetCrnt + 2) & " to be merged"
              Exit Do
            End If
          Next
          ' Cells with Rows 2 to 4 of a block must contain
          ' surnames or be empty.
          ' Check they do not contain flat numbers
          For RowSrcSheetCrnt = RowSrcSheetBlockStart + 1 To _
                                RowSrcSheetBlockStart + 3
            For ColSrcSheetCrnt = 1 To ColSrcSheetLast + 2
              CellValue = .Cells(RowSrcSheetCrnt, ColSrcSheetCrnt).Value
              If CellValue = "" Or Not LCase(CellValue Like "#####[a-z]") Then
                ' Cell valid
              Else
                ' Cell contains a flat number.  Cannot be a surname.
                Print #OutputFileNum, "  Row " & RowSrcSheetCrnt & _
                      " should only contain surnames but column " & _
                      ColNumToCode(ColSrcSheetCrnt) & " contains a flat number"
                Exit Do
              End If
            Next
          Next
          ' Rows 5 of a block to the next blank row should contain
          ' nothing but flat numbers.
          RowSrcSheetCrnt = RowSrcSheetBlockStart + 5
          Do While True     ' Loop until find a blank row
            Found = False  ' Nothing found on this row
            For ColSrcSheetCrnt = 1 To ColSrcSheetLast + 2
              CellValue = .Cells(RowSrcSheetCrnt, ColSrcSheetCrnt).Value
              If CellValue <> "" Then
                Found = True  ' Value found on this row
                If LCase(CellValue Like "#####[a-z]") Then
                  ' Cell valid
                Else
                  ' Cell does not contain a flat number
                  Print #OutputFileNum, "  Row " & RowSrcSheetCrnt & _
                        " should only contain flat numbers but column " & _
                        ColNumToCode(ColSrcSheetCrnt) & " contains " & CellValue
                  Exit Do
                End If
              End If
            Next
            If Not Found Then
              ' This is a blank line
              Exit Do
            End If
            RowSrcSheetCrnt = RowSrcSheetCrnt + 1
          Loop
          ' This block is finished.
          Print #OutputFileNum, "  No error found in block starting " & _
                                "on row " & RowSrcSheetBlockStart
          ' Is there another block?
          If RowSrcSheetCrnt > RowSrcSheetFinal Then
            ' No more blocks
            Exit Do
          Else
            ' Find the next row with a value which should be the first
            ' row of the next block.
            Set Rng = .Cells.Find("*", .Cells(RowSrcSheetCrnt, 1), _
                                          xlFormulas, , xlByRows, xlNext)
            If Rng Is Nothing Then
              ' This should not be possible since
              ' have already found a value on a later row
              Print #OutputFileNum, "  I expected another block under row " _
                          & RowSrcSheetCrnt & " but I could not find it."
              Exit Do
            End If
            RowSrcSheetBlockStart = Rng.Row
          End If
        Loop
      End With
    End With

    ' Load details of next worksheet
    With WBookThis.Sheets("SourceSheets")
      ColSrcList = ColSrcList + 1
      ' Load name of next worksheet
      WSheetOtherNameCrnt = .Cells(RowSrcList, ColSrcList).Value
      If WSheetOtherNameCrnt = "" Then
        ' There are no more worksheets on this row
        RowSrcList = RowSrcList + 1
        WBookOtherNameCrnt = .Cells(RowSrcList, 2).Value
        If WBookOtherNameCrnt = "" Then
          ' End of list reached.
          Exit Do
        End If
        ColSrcList = 3
        WSheetOtherNameCrnt = .Cells(RowSrcList, ColSrcList).Value
        If WSheetOtherNameCrnt = "" Then
          Call MsgBox("Row " & RowSrcList & " of SourceSheets has a " & _
                      "workbook name but no worksheet name.", vbOKOnly)
          Close OutputFileNum       ' Close text file
          Exit Sub
        End If
      End If
    End With
  Loop

  If Not WBookOther Is Nothing Then
    WBookOther.Close SaveChanges:=False
    Set WBookOther = Nothing
  End If

  Close OutputFileNum       ' Close text file

End Sub
Function ColCodeToNum(ColStg As String) As Long

  Dim lcColStg                  As String

  lcColStg = LCase(ColStg)
  ColCodeToNum = IIf(Len(ColStg) > 1, (Asc(Left(ColStg, 1)) - 64) * 26, 0) + _
                 Asc(Right(ColStg, 1)) - 64

End Function
Function ColNumToCode(ColNum As Long) As String

  ColNumToCode = IIf(ColNum > 26, Chr(64 + ((ColNum - 1) \ 26)), "") & _
                 Chr(65 + ((ColNum - 1) Mod 26))

End Function