在perl中解析特定类型的字符串

时间:2012-05-14 06:07:58

标签: regex string perl parsing

我有以下类型的字符串(引号表示它们都在一行上):

" AMINO-2,4,6-TRIIODOBENZOIC酸Hugo Holtermann,Baerum,Leif Gunnar Haugen,奥斯陆和Knut Wille,Baerum,挪威,Nye-5"

的转让人

"生产乙烯化合物的工艺Duncan Clark和Percy Hayden,Norton-on-Tees,Eng- 5土地,转让给ImperiaI Chemical Industries Limited,英国伦敦"

我希望获得标题之后的所有内容(全部大写的部分)。所以我想得到:

" Hugo Holtermann,Baerum,Leif Gunnar Haugen,奥斯陆和Knut Wille,Baerum,挪威,Nye-5"

的转让人

" Duncan Clark和Percy Hayden,Norton-on-Tees,Eng-5土地,转让给ImperiaI Chemical Industries Limited,英国伦敦"

我有比这两个更多的字符串,但基本格式是本发明的标题总是大写的字母和数字。

有没有办法在perl中使用正则表达式?

5 个答案:

答案 0 :(得分:1)

如果它不需要100%准确,我只会寻找第一个大写字母,然后是第一个小写字母,然后抓住剩下的字母。

这样的事情(我的perl有点生疏,原谅任何语法错误):

$part_of_line = $full_line =~/([A-Z][a-z].*)/

答案 1 :(得分:0)

试试这个:

$text = "PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England ";

if($text =~ m/(\b[A-Z0-9-, ]+)\b(.*)/) {
    print "$2";
}

答案 2 :(得分:0)

我尝试了这个,得到了你期待的输出

if($ip =~ m/([A-Z0-9,\- ]+)([A-Z]+[a-z]+.*)/)
{
      print "$2";
}

答案 3 :(得分:0)

标题总是以大写字母+空格结尾,所以这应该有效:

/^.+[A-Z]+ (.+)$/;
print $1;

答案 4 :(得分:0)

怎么样:

#!/usr/bin/perl
use strict;
use warnings;
use 5.014;

my $re = qr
    /^                # Start of string
    [\p{Lu}\pN, -]+   # one or more uppercase letter or number or comma or space or dash
    (                 # start group 1
      \p{Lu}[\pL.']   # one uppercase letter followed by any letter or dot or apostroph
    )                 # end group
    /x;
while(<DATA>) {
    chomp;
    s/$re/$1/g;       # replace match by group 1
    say;
}


__DATA__
AMINO-2,4,6-TRIIODOBENZOIC ACIDS Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS D.Clark
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS O'Connors

<强>输出:

Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
D.Clark
O'Connors