查找未公开的报价(' - 或“ - 样式)

时间:2014-07-23 21:14:23

标签: regex

我直接从OCR引擎编辑一些文本,在某些段落中,OCR引擎忽略了开始和结束引号。我更喜欢在HTML模式下进行编辑,因此最终得到一些文字,如:

<p>&ldquo;Wait a moment,&rdquo; Jacey said. The street light lit up his aged, rat face. Who&rsquo;s on the move?&rdquo;</p>

注意缺少&ldquo;

另一句话:

<p>&ldquo;He said he&rsquo; coming afer you,&rdquo; Harry said, and he&rsquo; bringing the boys too!&rdquo;</p>

我使用这个正则表达式:([>\.\,])(.*?)&rdquo;这似乎是第二句话的工作,但不是第一句。这是因为正则表达式从左到右匹配,因此匹配了不应在引号内的额外句子The street light lit up his aged, rat face.。 我认为如果匹配是从右到左完成的,问题就可以解决了。我知道这是C#中提供的一个选项,但我使用基于文本的编辑器的正则表达式引擎编辑一个简单的文本文件。有没有办法找到&ldquo;之前的最后一句话,即句子Who&rsquo;s on the move?

[编辑] 我一直在尝试使用lookbehind正则表达式:(?<=(?:\. |, |>)(\w)(.*?))(&rdquo;)它似乎找到了所有缺少开放引号的句子&ldquo;,但问题是我无法用{(?<=)构造替换\3&ldquo;\1\2\3构造内的内容{1}}因为lookbehind是0长度。相反,文本只是重复。例如,对于上述正则表达式,句子Who&rsquo;s on the move?&rdquo;变为Who&rsquo;s on the move?&rdquo;&ldquo;Who&rsquo;s on the move?&rdquo;

任何想法将不胜感激。 感谢

1 个答案:

答案 0 :(得分:4)

递归和定义子例程

以下正则表达式检查字符串是否平衡。下面的代码(参见the online demo中的输出)检查几个字符串。解释在评论中。

$balanced_string_regex = "~(?sx)                  # Free-Spacing
(?(DEFINE)            # Define a few subroutines
   (?<double>&ldquo;(?:(?!&[lr]squo;).)*&rdquo;)  # full set of doubles (no quotes inside)
   (?<single>&lsquo;(?:(?!&[lr]dquo;).)*&rsquo;)  # full set of singles (no quotes inside)
   (?<notquotes>(?:(?!&[lr][sd]quo;).)*)          # chars that are not quotes
)                     # end DEFINE

^                       # Start of string
(?:                     # Start non-capture group
   (?&notquotes)        # Any non-quote chars
   &l(?<type>[sd])quo;  # Opening quote, capture single or double type
   # any full singles, doubles, not quotes or recursion
   (?:(?&single)|(?&double)|(?&notquotes)|(?R))*
   &r\k<type>quo;       # Closing quote of the correct type
   (?&notquotes)      # 
)++                   # Repeat non-capture group
$                     # End of string
~";

$string = "&ldquo;He said  &rdquo; &lsquo;He said  &rsquo;";
check_string($string);
$string = "<p>&ldquo;Wait a moment,&rdquo; Jacey said. The street light lit up his aged, rat face. Who&rsquo;s on the move?&rdquo;</p>";
check_string($string);
$string = "<p>&ldquo;Wait a moment,&rdquo; Jacey said. The street light lit up his aged, rat face. &lsquo;Whos on the &ldquo;move?&rdquo; &rsquo;</p>";
check_string($string);
$string = "<p>&ldquo;He said he&rsquo; coming afer you,&rdquo; Harry said, and he&rsquo; bringing the boys too!&rdquo;</p>";
check_string($string);
$string = "<p>&ldquo;He &lsquo;said he&rsquo; coming afer you,&rdquo; Harry said, and he&ldquo; bringing the boys too!&rdquo;</p>";
check_string($string);


function check_string($string) {
    global $balanced_string_regex;
    echo (preg_match($balanced_string_regex, $string)) ?
        "Balanced!\n" :
        " Nah... Not Balanced.\n" ;
}

<强>输出

Balanced!
 Nah... Not Balanced.
Balanced!
 Nah... Not Balanced.
Balanced!

替换遗失的引号

正如我在评论中指出的那样,IMO取代缺失的报价是危险的:在缺失的报价落在哪个词之前或之后?如果有任何嵌套,我们能否确定我们已正确识别缺失的报价?因此,如果你要做任何事情,我倾向于匹配平衡部分(希望它是正确的)并删除任何额外的引号。

上述模式适用于各种调整。例如,在this regex demo上,我们匹配并替换不平衡的报价。由于这是被要求的,我将提供第二个潜在的调整,有些不情愿 - this one在不匹配的右引号之前的短语的开头插入一个丢失的左引号。