Question

我正在解析html文件以提取其部分文本以创建epub。问题是，在提取的文本中，有时最后一段是空的，我想删除。那么......我怎样才能删除这个空白段落（以及之后的任何空白空格），条件是以后任何其他代码行都没有附加文本？

例如：

<p>&nbsp;</p>
<p>“Lorem ipsum dolor sit amet, consectetur adipiscing elit.”</p>
<p>Sed ut perspiciatis unde omnis iste natus error...</p>
<p>&nbsp;</p>
<p>“Lorem ipsum dolor sit amet.”</p>
<p>Omnis iste natus error sit voluptatem.</p>
<p>&nbsp;</p>

那么，我该怎样做才能从上面的代码中删除 的最后一个实例（以及之后的任何其他空格/换行符）？

我已经尝试在perl-regex上使用这个否定前瞻 \s*?(?!)来排除搜索中的后续段落，但它仍然找到 的先前实例，我只需删除此段落＃39;是文件中的最后一个。

提前谢谢！

修改-1：

要明确，我想要这个：

<p>&nbsp;</p>
<p>“Lorem ipsum dolor sit amet, consectetur adipiscing elit.”</p>
<p>Sed ut perspiciatis unde omnis iste natus error...</p>
<p>&nbsp;</p>
<p>“Lorem ipsum dolor sit amet.”</p>
<p>Omnis iste natus error sit voluptatem.</p>
<p>&nbsp;</p>

成为这个：

<p>&nbsp;</p>
<p>“Lorem ipsum dolor sit amet, consectetur adipiscing elit.”</p>
<p>Sed ut perspiciatis unde omnis iste natus error...</p>
<p>&nbsp;</p>
<p>“Lorem ipsum dolor sit amet.”</p>
<p>Omnis iste natus error sit voluptatem.</p>

即我想在最后一行删除 （其后没有其他文字），所以我想知道哪个perl-regex搜索我应该只使用这个字符串的这个特定实例，所以我可以用任何东西替换它，并从代码中删除它。

修改-2：

根据ikegami的建议，我使用\s*(?> \s*)(?!)\s*作为搜索字符串，仅查找要删除的html代码的最后一个空白段落（ ）。在他的回答中真正有所作为的似乎是使用原子分组(?>...)。没有它，我从代码的其他行中挑选了我不想要的其他相同字符串的实例。不确定为什么（真的不是正则表达式的专家），但那是我从测试中得到的。

我只是在电子书编辑器中使用perl-regex的一些基本查找/替换操作来清理代码，因此我不确定它在其他情况下的行为方式。无论如何，我很欣赏其他帮助我的尝试，其中一些甚至对我来说太技术化了，我希望这个答案能帮助将来有类似问题的人。再次感谢你！

Answer 1

简答：使用$锚。

答案很长：

#!/usr/bin/env perl

use strict;
use warnings;

my $fn = $ARGV[0] or die 'filename required!';

# Not certain what encoding your file is in
open (my $fh, '< :encoding(UTF-8)', $fn) 
    or die "could not open file '$fn': $!";

# slurp entire file
my $content = do{ local $/; <$fh>; };
close $fh;

# If it ends with the blank paragraph followed by newline/tab/space,
# overwrite the file
if ( $content =~ s /<p>&nbsp;<\/p>[\n\s\t\r]*$// ){
    open (my $fh, '> :encoding(UTF-8)', $fn)
        or die "could not open file '$fn' to write: $!";
    print $fh $content;
    close $fh;
}

Answer 2

我根本不会使用正则表达式。正如您将看到的那样，这是一种复杂的方法。

如果您想检查

之后的任何地方，请确保没有 \s*

您想要检查以下所有字符都不是的开头。

您需要检查以下所有字符是否不是的开头。

你想：

s/(?><p>&nbsp;</p>\s*)(?=(?:(?!<p>).)*\z)//s

其他变化：

?中的\s*?毫无意义;你不想匹配最少的。
(?>...)阻止模式开始在内寻找 \s*。在这个特定的模式中（但不是下面的模式），它只是作为一种优化。

如果您想检查 之后确定没有 \s*

你想：

s/(?><p>&nbsp;</p>\s*)(?!<p>)//

Answer 3

EDIT-2.:

Based on a suggestion by ikegami, I used 
\s*(?><p>&nbsp;</p>\s*)(?!<p>)\s* as the search string to find only 
the last blank paragraph (<p>&nbsp;</p>)of the html code to be 
removed. What really made the difference in his answer seems to have 
been the use of an atomic grouping (?>...). Without it, I was picking 
other instances of the same string from other lines of the code that I
didn't want. Not sure why (really not an expert on regex), but that's 
what I got from my tests.

你不应该使用你不理解的东西实际上，你应该学习正则表达式，而不是要求别人为你做。

这里只是一个简短的教育，这个正则表达式也是如此 \s* (?>\s*)(?!)\s*
因为这两个(?>\s*) (?!)的并置阻止了这一点引擎放弃一个空格来满足断言如果段落标记之间没有空格，则原子组为
不必要。

为了进一步探讨这一点，断言(?!)假设存在  \s*之后的段落标记直接这对于设计来说是一个糟糕的选择。如果那里没有段落标记它会匹配它，即使它存在于某个地方。

其他设计考虑因素：

你声明这是html，这意味着将成为其后的其他标签最后 。这意味着你不需要EOS锚点$或\z 正则表达式，除非你使用@ikegami建议的正则表达式(?> \s*)(?=(?:(?!).)*\z)
需要(?s)才能使用它。

然而，这个正则表达式是一个可怕的想法!!
每当它找到 时，它必须停止并一直搜索到字符串的结尾。
如果您在文档中有500  ，那么它大部分相当于搜索相同的文件文本500次。

安全和苛刻的方法

这种方式 - ＆gt; (?s).*\K \s*

它直接通过(?s).*来直接进入最后一个然后使用\K，
跳过该部分比赛只留下 \s*没有被替换。

基准测试：
目标样本包含10个重复的OP样本@ 30  

@ikegami Regex:   (?s)(?><p>&nbsp;</p>\s*)(?=(?:(?!<p>).)*\z)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    3.24 s,   3236.42 ms,   3236424 µs


Safe and Sane Regex:   (?s).*\K<p>&nbsp;</p>\s*
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    0.10 s,   102.04 ms,   102044 µs

只有在代码的最末端

3 个答案: