如何从正则表达式模式中仅捕获姓氏?

时间:2013-03-07 19:27:27

标签: regex perl

我编写了一个Perl程序来验证姓氏,名字和年份的格式(标点符号等)的准确性。 如果特定条目不遵循指定模式,则突出显示该条目以进行修复。

例如,我的输入文件包含类似文本的行:

<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

我的程序工作得很好,也就是说,如果任何条目不遵循该模式,脚本会生成错误。上面的输入文本不会产生任何错误。但下面的一个是错误示例,因为 Rose A. J。 Rose 之后缺少逗号:

NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., &amp; Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey &amp; D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>

从我的正则表达式搜索模式中,是否可以捕获所有的姓氏和年份,因此我可以生成一个前缀为每行的文本,如下所示?

<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

我的正则表达式搜索脚本如下:

while(<$INPUT_REF_XML_FH>){
    $line_count += 1;
    chomp;
    if(/

    # bibliomixed XML ID tag and attribute----<START>
    <bibliomixed
    \s+
    id=".*?">
    # bibliomixed XML ID tag and attribute----<END>

    # --------2 OR MORE AUTHOR GROUP--------<START>
    (?:
    (?:
    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>
    ,\s)+
    #---------------FINAL AUTHOR GROUP SEPATOR----<START>
    &amp;\s
    #---------------FINAL AUTHOR GROUP SEPATOR----<END>

    # --------2 OR MORE AUTHOR GROUP--------<END>
    )? 

    # --------LAST AUTHOR GROUP--------<START>

    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>

    (?: # pattern for editor notation----<START>
    \s\(Ed(?:s)?\.\)\.
    )? # pattern for editor notation----<END>

    # --------LAST AUTHOR GROUP--------<END>
    \s
    \(
    # pattern for a year----<START>
    (?:[A-Za-z]+,\s)? # July, 1999
    (?:[A-Za-z]+\s)? # July 1999
    (?:[0-9]{4}\/)? # 1999\/2000
    (?:\w+\s\d+,\s)?# August 18, 2003
    (?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
    (?:[A-Za-z])? # 1999a
    (?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
    (?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
    (?:,\s[A-Za-z]+)? # 1999, Spring
    (?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
    (?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
    (?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
    # pattern for a year----<END>
    \)\.
    /six){
        print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
        $found_count += 1;
    } else{
        print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
        $not_found_count += 1;
    }

感谢您的帮助,

炳廷

2 个答案:

答案 0 :(得分:0)

改变这一点

# pattern for surname----<END>
    ,?\s

现在这意味着一个可选的,后面是空格。如果人姓是&#34; Bunga Bunga&#34;它不会工作

答案 1 :(得分:0)

所有子模式都是非捕获组,从(?:开始。这减少了许多因素的编译时间,其中一个因素是未捕获子模式。

要捕捉图案,您只需要在需要捕捉的部分周围放置括号。因此,您可以删除非捕获断言?:或将parens ()放在您需要的位置。 http://perldoc.perl.org/perlretut.html#Non-capturing-groupings

我不确定,但是,从您的代码中我认为您可能正在尝试使用前瞻断言,例如,您使用空格测试姓氏,如果没有,则使用连字符测试姓氏。这不会从每次相同的点开始,它将与第一个示例匹配,然后继续前进以测试第二个姓氏模式的下一个位置,然后正则表达式将测试第一个子模式的第二个名称是什么我不确定。 http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind

#!usr/bin/perl

use warnings;
use strict;


my $line = '123 456 7antelope89';

$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;

my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7bealzelope89';

$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7canteloupe89';

$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);

exit 0;

为了捕获整个模式,第三个示例的第一个模式没有意义,因为这告诉正则表达式在捕获模式组时不捕获模式组。其中有用的是第二种模式,即细粒度模式捕获,因为捕获的模式是非捕获组的一部分。

a: 123 456 b: 7antelope89
a: nocapture b: nocapture 
a: 123 456 b: canteloupe

一点点小便

  id=".*?" 

可能会更好

  id="\w*?"

id名称需要_alphanumeric iirc。