Question

我在HTML中有多个嵌套引号，如下所示：

<div class="quote-container">
   <div class="quote-block">
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
   </div>
</div>

我需要搜索并删除引号。我用表达式：

<div class="quote-container">.*<div class="quote-block">.*</div>.*</div>

这适用于单引号。但是，多嵌套引号存在问题（例如上面的例子）。

我的任务是搜索：

<div class="quote-container">.*<div class="quote-block">

加上任何不包含

的字符串

<div

以

结尾

.*</div>.*</div>

我尝试了这样的lookbehind和lookahead断言：

<div class="quote-container">.*<div class="quote-block">.*(?!<div).*</div>.*</div>

但它们不起作用。

有办法完成我的任务吗？我需要一个可以在TextPipe中使用的perl表达式（我用它来进行论坛解析，然后我会进行文本到语音转换）。

提前致谢。

Answer 1

我认为您的问题是您使用greedy种群.*。

尝试将所有.*替换为非贪婪的.*?

Answer 2

我个人会通过更换引号来解决这个问题，直到不再有任何引号来替换掉。在一个正则表达式替换中，真的没办法处理这个问题，你需要做的是：

伪代码：

html="... from your post ...";
do{
 newhtml=html
 newhtml=replace(
        '/<div class="quote-container">.*<div class="quote-block">.*</div>.*</div>/s',
        '',
        newhtml
    )
} while(newhtml!=html)
html=newhtml

这将处理所有嵌套引号。

Answer 3

正则表达式是操纵嵌套结构的不良选择。我会为这个问题编写一个特定的解析器（一个简单的基于堆栈的解析器就足够了）。

正则表达式HTML嵌套引号替换

3 个答案: