Question

在解析网页时，我的解析器因DOM结构无效而停止。我想通过替换某个节点来修复它。

我发现有一个额外的</div>导致解析器停止。

我需要编写一个正在检查的正则表达式：如果有</div>后跟</div> [即中间没有起始<div>标记。它将检查<div，因为代码可能包含ID或类，然后最后</div>将被<div></div>替换。

即。如果</div>后跟</div>，则最后一个将被替换为<div></div>。

提前致谢。

例如： <div> <img src="/lexus-share/images/spacer.gif" width="2" height="15" border="0" alt=""> </div> <a href="http://www.somedomain.com"><img src="/pub-share/images.jpg"></a> </div>

Answer 1

这只适用于没有嵌套<div>的情况（不确定它们是否合法）：

$result = preg_replace(
    '%</div>       # Match a closing div tag
    (              # Match and capture in group 1...
     (?:           # ...the following regex:
      (?!</?div>)  # Match (unless a div tag intervenes)
      .            # any character.
     )*            # Repeat any number of times.
    )              # End of capturing group
    (?=</div>)     # Assert that a closing div tag follows%six', 
    '</div><div>\1', $subject);

此更改

<div> <img src="/lexus-share/images/spacer.gif" width="2" height="15" border="0" alt=""> </div> <a href="http://www.somedomain.com"><img src="/pub-share/images.jpg"></a> </div>

到

<div> <img src="/lexus-share/images/spacer.gif" width="2" height="15" border="0" alt=""> </div><div> <a href="http://www.somedomain.com"><img src="/pub-share/images.jpg"></a> </div>

Answer 2

我建议您尝试使用不同的方法而不是使用正则表达式，因为使用嵌套标记并不容易。

我不知道您使用什么语言来解析文档，但您可以编写的代码逻辑是：

解析整个文档，搜索字符串div>并制作2个变量来计算openingDivs和closingDivs。

如果div>之前的字符为<， openingDivs ++

如果div>之前的字符为/，则关闭Divs ++并检查if (closingDivs > openingDivs)

如果条件变为真，则可以使程序输出div的位置或用空格替换</div>或为空。

希望这会有所帮助。：）

如果条件匹配，则为preg_replace

2 个答案: