Question

input.txt中

Ken, Robert. (1994). Lessons from Hull House for the contemporary urban university 2008. Social Service Review, 68(3), 299-321.

Robert, John. 1994. Lessons from Hull House for the contemporary urban university 2008. Social Service Review.

Output.txt的

Ken, Robert. (<y>1994</y>). Lessons from Hull House for the contemporary urban university 2008. Social Service Review, 68(3), 299-321.

Robert, John. <y>1994</y>. Lessons from Hull House for the contemporary urban university 2008. Social Service Review.

我尝试了以下编码，但是我得到了最后一个occyears的标签有人可以帮我解决问题吗

print "Enter the exp file name without extension: ";
chomp($filename = <STDIN>);
open(RED, "$filename.txt") || die "Could not open EXP file";
open(WRIT, ">$filename.html");

while(<RED>) {
    if(/(.+)(\d{4})/) {
        s/(.+)(\d{4})/$1<y>$2<\/y>/g;
    }
print WRIT $_;
}
close(RED);
close(WRIT);

Answer 1

你有贪婪的正则表达式，所以只有去年每一行都匹配。 ?使+量词非贪婪（尽可能不匹配）

if (/(.+?)(\d{4})/) {
   s/(.+?)(\d{4})/$1<y>$2<\/y>/g;
}

作为旁注，您可以使用

简化上述代码

s/(\d{4})/<y>$1<\/y>/g;

Answer 2

没有必要先匹配一年然后替换它。

无需捕捉您不匹配的内容。

然而，有必要确保你谈论合法的年份 - 长度为四位数，也许在最后一个或当前世纪内。

最简单的说法当然是你永远不想使用的方式：

# DO NOT USE THIS: IT IS ILLEGIBLE!!
s{(\b(?=19|20)\d{4}\b)}{<y>$1</y>}g;

相反，你应该分解它，以便更容易阅读：

s{
    (                   # save in numbered buffer $1
        \b              # word-break
        (?= 19 | 20)    # next two chars must be either 19 or 20
        \d{4}           # the year proper
        \b              # word break
    )                   # end of numbered capture $1
}{<y>$1</y>}gx;

如果您运行的是Perl v5.10或更高版本，则可以使用命名的捕获而不仅仅是编号的捕获：

s{
    (?<YEAR>            # save in named buffer "year"
        \b              # word-break
        (?= 19 | 20)    # next two chars must be either 19 or 20
        \d{4}           # the year proper
        \b              # word break
    )                   # end of named capture "year"
}{<y>$+{YEAR}</y>}gx;

如果替换部分看起来太压缩，您也可以使用：

s{
    (?<YEAR>            # save in named buffer "year"
        \b              # word-break
        (?= 19 | 20)    # next two chars must be either 19 or 20
        \d{4}           # the year proper
        \b              # word break
    )                   # end of named capture "year"
}{
    "<y>"       .
    $+{YEAR}    .
    "</y>"
}egx;

最后，您应该知道\d匹配任何具有Numeric_Type=Decimal字符属性的代码点，而不仅仅是ASCII。因此，为避免误报，您可能希望与\d交换[0-9]：

s{
    (?<YEAR>            # save in named buffer "year"
        \b              # word-break
        (?= 19 | 20)    # next two chars must be either 19 or 20
        [0-9]{4}        # the year proper
        \b              # word break
    )                   # end of named capture "year"
}{
    "<y>"       .
    $+{YEAR}    .
    "</y>"
}egx;

或者，如果您运行的是Perl v5.14或更高版本，则可以使用/a选项：

s{
    (?<YEAR>            # save in named buffer "year"
        \b              # word-break
        (?= 19 | 20)    # next two chars must be either 19 or 20
        \d{4}           # the year proper
        \b              # word break
    )                   # end of named capture "year"
}{
    "<y>"       .
    $+{YEAR}    .
    "</y>"
}egxa;

如果您认为其他世纪适用，则很容易修改限制允许几个世纪的前瞻。

Answer 3

你被绊倒的事情是正则表达式匹配是贪婪的。这意味着：.+抓住了所有可能的东西，只留下足以完成第二场比赛。

所以它只能使用一次：

Ken, Robert. (1994). Lessons from Hull House for the contemporary urban university 2008. Social Service Review, 68(3), 299-321.

.+将匹配2008之前的所有内容（包括(1994)）。

你需要使用非贪婪的比赛。正如perlre中所述：

+?        Match 1 or more times, not greedily

所以请尝试：

(.+?)(\d{4})

编辑：正如评论中所述。捕获(.+)是多余的。有条件的。因此代码如下：

while (<DATA>) {
    s/(\d{4})/<y>$1<\/y>/g;
    print;
}

__DATA__
Ken, Robert. (1994). Lessons from Hull House for the contemporary urban university 2008. Social Service Review, 68(3), 299-321.
Robert, John. 1994. Lessons from Hull House for the contemporary urban university 2008. Social Service Review.

此外：

开启use strict;和use warnings;。
根据用户输入提防open文件。如果你不理会你的意见，那将是危险的。
3论证open是一个好主意。

使用perl中的正则表达式在字符串中查找一年

3 个答案: