Question

我仍然尝试开发一个函数，从HTML文本中提取所有标题（h1，h2，h3，..），并指定id来构建目录。

我使用正则表达式创建了一个简单的脚本，但由于一些奇怪的原因它只收集了一个匹配（最后一个）

这里是我的示例代码：

Function RegExResults(strTarget, strPattern)
    dim regEx
    Set regEx = New RegExp
    regEx.Pattern = strPattern
    regEx.Global = True
    regEx.IgnoreCase = True
    regEx.Multiline = True
    Set RegExResults = regEx.Execute(strTarget)
    Set regEx = Nothing
End Function

htmlstr = "<h1>Documentation</h1><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p><h3 id=""one"">How do you smurf a murf?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper.</p><h3 id=""two"">How do many licks does a giraffe?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p>"

regpattern = "<h([1-9]).*id=\""(.*)\"">(.*)</h[1-9]>"

set arrayresult = RegExResults(htmlstr,regpattern) 
For each result in arrayresult
    response.write "count: " & arrayresult.count & "<br><hr>"
    response.write "0: " & result.Submatches(0) & "<br>"
    response.write "1: " & result.Submatches(1) & "<br>"
    response.write "2: " & result.Submatches(2) & "<br>"
Next

我需要提取所有标题加上每个标题知道什么类型的标题（1..9）和用于跳转到右侧标题段落（#ID_value）的id值。

我希望有人可以帮助我找出为什么这不符合预期。

谢谢

Answer 1

模式中的.*是贪婪的，但你需要懒惰来收集每一个可能的匹配。相反，你应该使用.*?。

通过一些改进，模式可能如下所示。

regpattern = "<(h[1-9]).*?id=""(.*?)"">(.*?)</\1>" 

' \1 means the same as the 1st group
' backslash (\) is redundant to escape double quotes, so removed it

我强烈建议您查看Repetition with Star and Plus。这篇文章非常有用，可以理解正则表达式中的懒惰和贪婪重复。

哦，我差点忘了，You can't parse HTML with Regex，你至少不应该这样。

使用RegEx获取所有标题以构建ToC（经典ASP）

1 个答案: