Excel或VBA将非结构化文本转换为列

时间:2018-10-17 21:44:27

标签: excel vba excel-vba

当文本没有适当的定界符时,如何将文本转换为列。

例如,如何更改以下行:

Excel View

变成这样:

enter image description here

在Excel中,“文本到”列似乎找不到正确的分隔符(空格,制表符...)。我在VBA中尝试了以下操作:

I1 = Mid(Cells(i, 1), 1, 16)
I2 = Mid(Cells(i, 1), 17, 33)
I3 = Mid(Cells(i, 1), 34, 49)
I4 = Mid(Cells(i, 1), 50, 53)
I5 = Mid(Cells(i, 1), 54, 66)
I6 = Mid(Cells(i, 1), 67, 82)
I7 = Mid(Cells(i, 1), 83, 99)
I8 = Mid(Cells(i, 1), 100, 116)
I9 = Mid(Cells(i, 1), 117, 133)

但是我知道它不适用于所有列。例如,对于I3,我得到了更多预期的值,例如:

enter image description here

我也尝试过替换标签(如果存在的话),例如:

MyString = Replace(MyString, vbTab, "")

但也不起作用。

还有另一种方法吗?

2 个答案:

答案 0 :(得分:1)

这是尝试使用自定义ReplaceWhitespace函数,该函数根据其长度依次替换空白部分。作为中间步骤,空白用分号代替;删除不必要的分号作为最后一步。 Split is used to read将解析后的字符串转换为数组,然后array is used将结果读取到工作表中。根据您的特定需求调整ReplaceWhitespace应该很简单。

请注意,此算法不会评估将单个空格字符的实例视为噪声(如在“未指定的管道中”)还是应视为有效的单词除错器(如在“单位成本”中)。因此,在ReplaceWhitespace"- -" ~~> "-;-"" UNASSIGNED " ~~> ";UNASSIGNED;"

中,将单个空白作为噪声被视为特殊情况。

假设屏幕截图中的数据位于A1:A4范围内,则此代码或多或少会产生所需的输出,如下面的屏幕截图所示。

编辑ReplaceWhitespace的初始设计基于反复试验。稍加思索之后,我意识到空格字符或分号为composite number的模式将由算法中查找那些字符数为质数的模式的行处理。我已经相应地更新了代码。

Sub ParseUnstructured()
    Dim i As Long
    For Each cell In Range("A1:A4")
        i = i + 1
        ' Clean whitespace:
        sRow = ReplaceWhitespace(cell.Value)
        ' Read to array
        Dim sArray() As String
        sArray() = Split(sRow, ";")
        ' Read to worksheet:
        Range("A1").Offset(5 + i).Resize(1, UBound(sArray)+1).Value = sArray
    Next cell
End Sub

Function ReplaceWhitespace(sInput As String) As String
    Dim sOutput As String
    ' Look for special cases with single-whitespace noise:
    sOutput = Replace(sInput, "- -", "-;-") ' Take care of "----- ----"
    sOutput = Replace(sOutput, "UNASSIGNED", ";UNASSIGNED;")
    ' Look for patterns where the number of "noise" characters is a prime number:
    sOutput = Replace(sOutput, "       ", ";") ' 7 whitespaces
    sOutput = Replace(sOutput, "     ", ";") ' 5
    sOutput = Replace(sOutput, "   ", ";") ' 3
    sOutput = Replace(sOutput, "  ", ";") ' 2
    ' sOutput = Replace(sOutput, " ", "_") ' 1 Optional
    sOutput = Replace(sOutput, ";;;;;", ";") ' 5 semicolons
    sOutput = Replace(sOutput, ";;;", ";") ' 3
    sOutput = Replace(sOutput, ";;", ";") ' 2
    sOutput = Replace(sOutput, "; ", ";") ' Takes care of some leftovers.
    ReplaceWhitespace = sOutput
End Function

运行ParseUnstructured()的结果:

enter image description here

答案 1 :(得分:1)

假设类别只能是几个定义的单词之一,您提供的数据确实具有规则的模式。

如果Category仅将是单个单词,则还可以假设UOM仅定义了几个单词。例如

  • 项目:第一个子字符串,后跟一个空格
  • 说明:单词数量可变,后跟类别
  • 类别:来自已定义词的列表
  • UOM:从已定义的单词列表中
  • 然后其余的都用空格隔开。

根据该模式,我们可以构造一个正则表达式,并在VBA宏中使用它来分割行。 当然,如果模式与此不同,则该方法将无效。但是您必须提供包含所有可变性的示例。

下面的宏假定类别将是ASSIGNEDUNASSIGNED,但是您可以在代码中的管道分隔列表中添加更多单词。

代码中还包含其他假设。

Option Explicit
Sub parseLine()
    Dim WS As Worksheet, R As Range, C As Range
    Dim RE As Object, MC As Object
    Dim vRes As Variant, I As Long

'Set original worksheet/range
'change to suit
'Below uses column A
Set WS = Worksheets("sheet1")
With WS
    Set R = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With

'Initialize regex engine
Set RE = CreateObject("vbscript.regexp")
With RE
    .Pattern = "^(\S+)\s+(.*)\s*\b(UNASSIGNED|ASSIGNED)\b\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)"
    .IgnoreCase = False
    .MultiLine = True
    .Global = True
End With

'Iterate through; create the Parse line and parse
Application.ScreenUpdating = False
For Each C In R
    If RE.Test(C.Text) = True Then
        Set MC = RE.Execute(C.Text)
        ReDim vRes(1 To MC(0).SubMatches.Count)
        For I = 1 To UBound(vRes)
            vRes(I) = MC(0).SubMatches(I - 1)
        Next I

        'write the results next to the column)
        With C.Offset(0, 1).Resize(columnsize:=UBound(vRes))
            .Clear
            .NumberFormat = "@"
            .Value = vRes
            .EntireColumn.AutoFit
        End With
    End If
Next C
Application.ScreenUpdating = True

End Sub