Ruby正则表达式:如果冒号位于它们之前,则忽略引号

时间:2011-02-13 12:39:15

标签: ruby regex

我正在尝试编写一个Ruby正则表达式,它可以捕获引用的短语,而不是那些在它们之前有“:”的短语。例如:

  奥巴马:“是的,我们可以!”

应该被忽略。

我写了一些测试:

http://rubular.com/r/OJmkLd68gc

2 个答案:

答案 0 :(得分:4)

编辑:还有更多调整。

根据输入的确切内容,这适用于ASCII:

 (?<! [:\s] ) \s* ( ["'] ) (?: (?! \1 ) . )+ \1

对于“Unicode'匹配'引号”,你必须在你的配对中更加,或许沿着以下几行:

(?xs) (?<!:) \s+ 
  (?: ( ["'] ) (?: (?! \1 ) . )+ \1
    | “ .*? ”    # English etc
    | ‘ .*? ’   
    | « .*? »    # French, Spanish, Italian
    | ‹ .*? ›
    | „ .*? “    # German, Icelandic, Romanian
    | ‚ .*? ‘
    | „ .?* ”    # Hungarian
    | ” .?* ”    # Swedish
    | ’ .?* ’    
    | » .?* «    # Danish, Hungarian
    | › .*? ‹
    | 「 .*? 」   # Japanese, Chinese
    | 『 .?* 』  
  )

您可以阅读有关各种语言here使用的各种引号的更多信息。

这是Perl中的一个测试程序,但原则应该在Ruby中完美地存在:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw[ :std IO :utf8 ];
while (<DATA>) {
    print if / (?<! [:\s] ) \s* ( ["'] ) (?: (?! \1 ) . )+ \1/sx;
}
__END__
"Take off, hoser!"
Dorothy Parker:Brevity is the soul of lingerie.
Dorothy Parker:"Brevity is the soul of lingerie."
Dorothy Parker: "Brevity is the soul of lingerie."
Dorothy Parker:  "Brevity is the soul of lingerie."
Larry Wall: I don't know if it's what you want, but it's what you get. :-)
Larry Wall said, "I don't know if it's what you want, but it's what you get. :-)"
Larry Wall said: “I don't know if it's what you want, but it’s what you get. :-)”
Larry Wall said:   “I don't know if it's what you want, but it’s what you get. :-)”
Larry Wall said, “I don't know if it's what you want, but it's what you get. :-)”
Boss: And what's that "goto" doing there?!?
Hacker: Er, I guess my finger slipped when I was typing "getservbyport"...
‘Nevermore!’ quoth the raven.
Quoth the raven: ‘Nevermore!’
'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.
src/perl/mg.c: "I wish I had never come here, and I don't want to see no more magic," he said, and fell silent.
src/perl/mg.c: 'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.
src/perl/mg.c => "I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent."
‘I wish I had never come here, and I don’t want to see no more magic,’ he said, and fell silent.’
“I wish I had never come here, and I don’t want to see no more magic,’ he said, and fell silent.”

输出

"Take off, hoser!"
Larry Wall: I don't know if it's what you want, but it's what you get. :-)
Larry Wall said, "I don't know if it's what you want, but it's what you get. :-)"
Larry Wall said: “I don't know if it's what you want, but it’s what you get. :-)”
Larry Wall said:   “I don't know if it's what you want, but it’s what you get. :-)”
Larry Wall said, “I don't know if it's what you want, but it's what you get. :-)”
Boss: And what's that "goto" doing there?!?
Hacker: Er, I guess my finger slipped when I was typing "getservbyport"...
'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.
src/perl/mg.c: 'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.
src/perl/mg.c => "I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent."

这可能看起来“错误”,但这是因为内部引用。这是一个更完整的版本,可以更好地说明问题:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw[ :std IO :utf8 ];
while (<DATA>) {
    chomp;    
    my $bingo = m{
        (?<! [:\s] ) \s*
        (?: (?<= ^  )
          | (?<= \s )
        )
        (?: ( ["'] ) (?: (?! \1 ) . )+ \1
          | “ .*? ”    # English etc
          | ‘ .*? ’
        )
    }sx;

    if ($bingo) {
        printf("Line %2d, quote 「%s」\n",   $., $&);
        printf(" " x 7 . "in line 『%s』\n", $_);
    } else {
        printf("Line %2d IGNORE 『%s』\n", $., $_);
    }    
}    
__END__
"Take off, hoser!"
Dorothy Parker:Brevity is the soul of lingerie.
Dorothy Parker:"Brevity is the soul of lingerie."
Dorothy Parker: "Brevity is the soul of lingerie."
Dorothy Parker:  "Brevity is the soul of lingerie."
Larry Wall: I don't know if it's what you want, but it's what you get. :-)
Larry Wall said, "I don't know if it's what you want, but it's what you get. :-)"
Larry Wall said: “I don't know if it's what you want, but it’s what you get. :-)”
Larry Wall said:   “I don't know if it's what you want, but it’s what you get. :-)”
Larry Wall said, “I don't know if it's what you want, but it's what you get. :-)”
Boss: And what's that "goto" doing there?!?
Hacker: Er, I guess my finger slipped when I was typing "getservbyport"...
‘Nevermore!’ quoth the raven.
Quoth the raven: ‘Nevermore!’
'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.
src/perl/mg.c: "I wish I had never come here, and I don't want to see no more magic," he said, and fell silent.
src/perl/mg.c: 'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.
src/perl/mg.c => "I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent."
‘I wish I had never come here, and I don’t want to see no more magic,’ he said, and fell silent.’
“I wish I had never come here, and I don’t want to see no more magic,’ he said, and fell silent.”

谁的输出是:

Line  1, quote 「"Take off, hoser!"」
       in line 『"Take off, hoser!"』
Line  2 IGNORE 『Dorothy Parker:Brevity is the soul of lingerie.』
Line  3 IGNORE 『Dorothy Parker:"Brevity is the soul of lingerie."』
Line  4 IGNORE 『Dorothy Parker: "Brevity is the soul of lingerie."』
Line  5 IGNORE 『Dorothy Parker:  "Brevity is the soul of lingerie."』
Line  6 IGNORE 『Larry Wall: I don't know if it's what you want, but it's what you get. :-)』
Line  7, quote 「 "I don't know if it's what you want, but it's what you get. :-)"」
       in line 『Larry Wall said, "I don't know if it's what you want, but it's what you get. :-)"』
Line  8 IGNORE 『Larry Wall said: “I don't know if it's what you want, but it’s what you get. :-)”』
Line  9 IGNORE 『Larry Wall said:   “I don't know if it's what you want, but it’s what you get. :-)”』
Line 10, quote 「 “I don't know if it's what you want, but it's what you get. :-)”」
       in line 『Larry Wall said, “I don't know if it's what you want, but it's what you get. :-)”』
Line 11, quote 「 "goto"」
       in line 『Boss: And what's that "goto" doing there?!?』
Line 12, quote 「 "getservbyport"」
       in line 『Hacker: Er, I guess my finger slipped when I was typing "getservbyport"...』
Line 13, quote 「‘Nevermore!’」
       in line 『‘Nevermore!’ quoth the raven.』
Line 14 IGNORE 『Quoth the raven: ‘Nevermore!’』
Line 15, quote 「'I wish I had never come here, and I don'」
       in line 『'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.』
Line 16 IGNORE 『src/perl/mg.c: "I wish I had never come here, and I don't want to see no more magic," he said, and fell silent.』
Line 17 IGNORE 『src/perl/mg.c: 'I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent.』
Line 18, quote 「 "I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent."」
       in line 『src/perl/mg.c => "I wish I had never come here, and I don't want to see no more magic,' he said, and fell silent."』
Line 19, quote 「‘I wish I had never come here, and I don’」
       in line 『‘I wish I had never come here, and I don’t want to see no more magic,’ he said, and fell silent.’』
Line 20, quote 「“I wish I had never come here, and I don’t want to see no more magic,’ he said, and fell silent.”」
       in line 『“I wish I had never come here, and I don’t want to see no more magic,’ he said, and fell silent.”』

此外,还有一个标准的Unicode派生属性,简称为\p{Quotation_Mark}\p{QMark},但Ruby不支持它。您可以使用the unichars script

列出这些全部内容
$ unichars '\p{qmark}'
 "    34 0022 QUOTATION MARK
 '    39 0027 APOSTROPHE
 «   171 00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
 »   187 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
 ‘  8216 2018 LEFT SINGLE QUOTATION MARK
 ’  8217 2019 RIGHT SINGLE QUOTATION MARK
 ‚  8218 201A SINGLE LOW-9 QUOTATION MARK
 ‛  8219 201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
 “  8220 201C LEFT DOUBLE QUOTATION MARK
 ”  8221 201D RIGHT DOUBLE QUOTATION MARK
 „  8222 201E DOUBLE LOW-9 QUOTATION MARK
 ‟  8223 201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
 ‹  8249 2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
 ›  8250 203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
 「 12300 300C LEFT CORNER BRACKET
 」 12301 300D RIGHT CORNER BRACKET
 『 12302 300E LEFT WHITE CORNER BRACKET
 』 12303 300F RIGHT WHITE CORNER BRACKET
 〝 12317 301D REVERSED DOUBLE PRIME QUOTATION MARK
 〞 12318 301E DOUBLE PRIME QUOTATION MARK
 〟 12319 301F LOW DOUBLE PRIME QUOTATION MARK
 ﹁ 65089 FE41 PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
 ﹂ 65090 FE42 PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
 ﹃ 65091 FE43 PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
 ﹄ 65092 FE44 PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
 " 65282 FF02 FULLWIDTH QUOTATION MARK
 ' 65287 FF07 FULLWIDTH APOSTROPHE
 「 65378 FF62 HALFWIDTH LEFT CORNER BRACKET
 」 65379 FF63 HALFWIDTH RIGHT CORNER BRACKET

您可以使用the uniprops script列出所有代码点的属性:

$ uniprops -a 2018
U+2018 ‹‘› \N{ LEFT SINGLE QUOTATION MARK }:
    \pP \p{Pi}
    All Any Assigned InGeneralPunctuation Case_Ignorable CI Common Zyyy Pi P General_Punctuation Gr_Base Grapheme_Base Graph GrBase Initial_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation QMark Quotation_Mark X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
    Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=General_Punctuation Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=A East_Asian_Width=Ambiguous EA=A Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=QU Line_Break=Quotation LB=QU Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=CL Sentence_Break=Close SB=CL Word_Break=MB Word_Break=MidNumLet WB=MB _Case_Ignorable _X_Begin

答案 1 :(得分:2)

在这里,我想http://rubular.com/r/hFylsgU3OT

 ^[^:]*"(.*?)"$

这个BTW是提出正则表达式问题的完美方式......示例,链接和明确说明