Question

我尝试通过双正斜杠和/或特定字符串（例如＆＃34;和＆＃34;）来分割文本。

Example A:
text1 a/s // text2 a/b
text1 a/s and text2 a/b


Example B:
text1. // text2,// text3-
text1. and text2, and text3-

示例A返回两个匹配项：text1 a / s text2 a / b
示例B返回三个匹配项：text1。 text2，text3 -

我得到了非常有用的提示如何通过单个正斜杠拆分： Split string on single forward slashes with RegExp，但试图找到一个排除两个正斜线或一个字符串的解决方案，证明是太具有挑战性。

如果可以将两个示例的一个解决方案组合在一起，则可以获得奖励积分：

Example C:
text1 a/s // text2, and text3-

我希望只有与VBA兼容的RegExp解决方案。

Answer 1

正如您所说，您已经在Split string on single forward slashes with RegExp中为不同的拆分字符提供了有效的解决方案。该代码实际上并没有拆分字符串，但它匹配除＆＃34; /＆＃34; s之外的所有内容。然后它会在collection中返回每个匹配项的结果（是的，它最终会分裂）。

您需要做的是匹配str中的每个字符，除非下一个字符是//或and。我们可以使用lookahead。

只需使用以下代码更改代码中的模式：

.Pattern = "(?!$)((?:(?!//|\band\b).)*)(?://|and|$)"

或者，如果要为每个标记修剪空格，请使用以下正则表达式：

.Pattern = "(?!$)((?:(?!\s*//|\s*\band\b).)*)\s*(?://|and|$)\s*"

虽然这也会匹配//或and，但它使用( group )来捕获实际令牌。因此，您必须使用.SubMatches(0)（第一组反向引用的内容）将标记添加到集合中。

在您的代码中，不要添加coll.Add r_item.Value，而是使用：

coll.Add r_item.SubMatches(0)

注意：如果您的字符串包含换行符，请不要忘记使用rExp设置.Multiline = True对象。

VBA代码：

Sub GetMatches(ByRef str As String, ByRef coll As Collection)

    Dim rExp As Object, rMatch As Object

    Set rExp = CreateObject("vbscript.regexp")
    With rExp
        .Global = True
        .MultiLine = True
        .Pattern = "(?!$)((?:(?!\s*//|\s*\band\b).)*)\s*(?://|and|$)\s*"
    End With

    Set rMatch = rExp.Execute(str)
    If rMatch.Count > 0 Then
        For Each r_item In rMatch
            coll.Add r_item.subMatches(0)
        Next r_item
    End If
End Sub

这就是你可以用你的例子来称呼它的方式：

Dim text As String
text = "t/xt1.//text2,and landslide/ andy  // text3-  and  text4"

'vars to get result of RegExp
Dim matches As New Collection, token
Set matches = New Collection

'Exec the RegExp --> Populate matches
GetMatches text, matches

'Print each token in debug window
For Each token In matches
    Debug.Print "'" & token & "'"
Next token
Debug.Print "======="

每个标记都在立即窗口中打印。

此代码是最初由@stribizhev

立即窗口中的输出：

't/xt1.'
'text2,'
'landslide/ andy'
'text3-'
'text4'
=======

更深入的解释

您可能想知道这种模式是如何运作的。我将尝试详细说明。要做到这一点，让我们只使用模式的重要部分，使用以下正则表达式（其余的并不是非常重要）：

((?:(?!//|\band\b).)*)(?://|and|$)

它可以很容易地分为两种结构：

首先，子模式((?:(?!//|\band\b).)*)是一个匹配每个标记的group，反向引用我们想要为每个匹配返回的文本。在vba中，群组将返回.SubMatches()。让我们把它踩下来：
- 内部表达式(?!//|\band\b).首先检查以确保其后面没有拆分字符串（＆＃34; //＆＃34;或＆＃ 34; and＆＃34）。如果不是，则正则表达式引擎匹配一个字符（注意结尾处的点）。就是这样，它匹配一个允许作为我们捕获的令牌的一部分的角色。
- 现在，它被包含在(?:(?!//|\band\b).)*中，为它可以匹配的每个字符重复它，我们得到令牌中的所有字符。这个结构最接近while loop。
  
  虽然它后面没有拆分字符串，但是获取下一个字符。
- 如果您考虑一下，它是我们都知道的构造.*，每个角色都有一个额外的条件。
第二个子模式(?://|and|$)更容易，只需匹配拆分字符串（＆＃34; //＆＃34;，＆＃34; and ＆＃34;或行尾）。它位于non-capturing group内，意味着它会匹配，但它不会存储其值的副本。

例如：

text1 a/s and text2 a/b//last
^        ^| |               [1]: 1st subpattern, captured in Matches(0).SubMatches(0)
|--------|^-^
|   1      2|               [2]: Split string, not captured but included in match
|-----------|
      3                     [3]: The whole match, returned by Matches(0)


For the second match, Matches(1).Value = " text2 a/b//"
                      Matches(1).Submatches(0) = " text2 a/b"

其余模式只是细节：

(?!$)是为了避免在该行的末尾匹配一个空字符串。
所有\s*都在那里修剪令牌（以避免在令牌的开头或末尾捕获空格）。

Answer 2

或者最简单的方法是：

Text = "text1 a/s // text2, and text3-"
text = Replace(text, " // ", vbNewLine)
text = Replace(text, " and ", vbNewLine)

arr = Split(text, vbNewLine)

For Each field In arr
  WScript.Echo Trim(field) 'Using Trim you can remove the spaces around
Next

你会得到：

text1 a/s
text2,
text3-

使用RegExp在双正斜杠和/或特定单词上拆分字符串

2 个答案:

VBA代码：

立即窗口中的输出：

更深入的解释