Question

我正在使用jEdit，我有一堆编码错误的HTML文件，我想要抓取其中的主要内容，而不是周围的HTML。

我需要<div class="main-text">和下一个</div>之间的所有内容。

必须有一种REGEX方式来实现这一点，jEdit允许我用正则表达式替换和查找。

我对正则表达式并不熟练，我需要很长时间才能解决这个问题 - 有人可以帮忙吗？

Answer 1

从字面上理解你的问题，你可以替换：

/.*<div class="main-text">(.*?)<\/div>.*/

\1（或$1取决于您的编辑使用的内容。）

但是，The Pony He Comes会咬你，因为如果你的“主文本”元素包含另一个<div>怎么办？如果你确定这不会发生，那你就没事了。否则，你就处于失败状态。使用空字符串替换/.*<div class="main-text">/可能更容易，然后手动查找结尾并删除之后的所有内容。

就此而言，此任务可能最容易手动执行，因此您无需在代码运行后仔细检查。

Answer 2

此正则表达式可以解决您的问题：/<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi

这是Perl中的一个例子：

my $str = '<div class="main-text"> and the next </div>';
$str =~ /<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi;
print $1;

示例在Perl中，但正则表达式可以独立应用语言。

以下是正则表达式的解释：

/       -start of the regex
   <\s*    -we can have < and whitespace after it
      div     -matches "div"
         \s+     -matches one or more whitespaces after the <div
         class="main-text"    -matches class="main-text" (so <div class="main-text" to here)
         [^>]*       -matches everything except >, this is because you may have more attributes of the div
         >          -matches >, so <div class="main-text"> until now
      (.*?)        -matches everything until </div> and saves it in $1
   <\/div>        -matches </div>, so now we have <div class="main-text">( and the next )</div> until now
/gi       -makes the regex case insensitive

Answer 3

此正则表达式捕获html标记

之间的文本

<(?<tag>div).*?>(?<text>.*)</\k<tag>>

décomposition：

＆LT * GT（DIV？）。？; ：第一个带div的开放标记，该组称为“标记”
（？。*）：标签之间的文本捕获
＆GT; ：结束div标签，后面引用名为“tag”的组

最后，捕获的结果给出了两组“标签”和“文本”，你的捕获是在“文本”中

正则表达式保留除DIV内容以外的所有内容

3 个答案: