Question

我有一个包含数百个SQL Insert语句的文件。我只想识别那些以HTML段落标记开头但没有结尾段标记的语句。

我正在尝试这些行

<p>[^\n]*(?!</p>) <-- a <p> followed by any number of characters until \n and then </p>

这不起作用。以下是样本数据

INSERT INTO `help` VALUES 
(1,1,'<p>Radiotherapy uses a beam of high&#45;energy rays (or particles) lymph nodes.</p>'),
(2,1,'<p>EBRT delivers radiation from a machine outside the body. '),
(3,1,'<p>Following lumpectomy radiotherapy <ul><li>Heading</li></ul></p>'),

理想情况下，我会在附加不存在的地方，例如在插入声明＃2中。

Answer 1

如果您使用此：

($\d+,\d+,'.*?)()?('$,)

您将获得以下部分的参考资料：

(1,1,'Radiotherapy uses a beam of high-energy rays (or particles) lymph nodes.＆lt; - 即前言和正文包括开场P标记
＆lt; - 可选的结束P标签..即您可能无法获得2的匹配。
'),＆lt; - 结束引号和括号，以及尾随逗号

然后您可以将其替换为：

$1$3（例如使用.NET样式的反向引用）。

即，使用每个反向引用重建字符串，使用明确的结束P标记，无论是否找到一个。

在不了解您的平台的情况下，我无法为您提供正确的正则表达式替换语法。

在.NET中它将是：

string input = @"INSERT INTO `help` VALUES 
(1,1,'<p>Radiotherapy uses a beam of high&#45;energy rays (or particles) lymph nodes.</p>'),
(2,1,'<p>EBRT delivers radiation from a machine outside the body. '),
(3,1,'<p>Following lumpectomy radiotherapy <ul><li>Heading</li></ul></p>'),";

Regex r = new Regex(@"(\(\d+,\d+,'<p>.*?)(</p>)?('\),)");
string output = r.Replace(input, "$1</p>$3");

Console.Write(output);

产生此输出：

INSERT INTO `help` VALUES
(1,1,'<p>Radiotherapy uses a beam of high&#45;energy rays (or particles) lymph nodes.</p>'),
(2,1,'<p>EBRT delivers radiation from a machine outside the body. </p>'),
(3,1,'<p>Following lumpectomy radiotherapy <ul><li>Heading</li></ul></p>'),

Answer 2

如果您确定后面跟着引号'，则以下内容适用于Perl（没有notepad ++）

/<p> [^\n]* (?<! <\/p> )  (?=') /gx

（/ x允许空格清晰）。这是一个负面的观察背景，它固定在报价的前瞻上。

RegEx：找到一个模式，然后是另一个带有间隙的模式

2 个答案: