匹配模式后如何在perl.regex中添加短划线后的短划线

时间:2012-08-13 14:22:16

标签: regex perl pattern-matching

我有这种类型的数据: 请帮助我,我是正则表达式的新手,请在回答时解释每一步。谢谢..

7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

我想从上面的行中仅提取这些数据:

7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

然后如果 AX1A 在下划线后包含两个连续的字母,它应该写成AX_,如果包含单个数字和单个字母,那么它们变为-1_和-A_所以在应用此模式后将成为:AX_-1_-A_和所有其他数据应保持相同。

同样在下一行“W1A”所以首先它包含单个字母“W”,它应该转换为-W_现在下一个字符是单个数字所以它也应该转换为相同的模式-1_同样最后一个也被处理同样,它变成-W_-1_-A _

我们只对将数字后跟下划线应用于部分后感兴趣。

_AX1A_

_W1A_

_U1A_

_AV21NA_ 
输出应该是:

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN

7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

5 个答案:

答案 0 :(得分:1)

我不知道你需要剥离的所有细节,但我会推断并让你澄清这是否不能满足你的需要。

第一步,提取1X50_RE_1X50_LI,您可以搜索这些字符串并将其替换为空。

接下来,要将您的第二个字母/数字代码拆分成小块,您可以使用一对匹配,每个匹配使用前瞻。但是,由于你只想弄乱第二个代码块,我首先将整个行拆分,然后再处理第二个块,然后再将这些块重新组合在一起。

while (<$input>) {

    # Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
    s/1X50_(RE|LI)_//;

    my @pieces = split /_/; # split the line into pieces at each underscore

    # Just working with the second chunk. /g, means do it for all matches found
    $pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
    $pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A

    # Join the pieces back together again
    $_ = join '_', @pieces;

    print;
}

如果您未指定,$_是许多Perl操作可以处理的变量。 <$input>将名为$input的文件句柄的下一行读入$_。如果没有给出,s///splitprint函数将在$_上运行。 =~运算符是您告诉Perl使用$pieces[1](或您正在处理的任何变量)而不是$_进行正则表达式操作的方式。 (对于splitprint,您将传递变量作为参数,因此split /_/split /_/, $_相同,printprint $_。)

哦,并解释一下正则表达式:

s/1X50_(RE|LI)_//;

这匹配包含1X50_RE1X50_LI的任何内容((|)是替代列表)并将其替换为空(最后为空//)。

查看其他一行:

s/([A-Z])(?=[0-9])/$1_-/g;

(...)周围的普通括号[A-Z]会将$1设置为内部匹配的任何字母(在本例中为字母A-Z)。 (?=...)括号引起零宽度正向前瞻断言。这意味着正则表达式只匹配字符串中的下一个匹配表达式(数字,0-9),但匹配的那部分不包括在被替换的字符串中。

/$1_-/导致字符串的匹配部分[A-Z]被替换为括号(...)捕获的值,但在查找头部之前[0-9] 1}},添加了您需要的_-

答案 1 :(得分:1)

你确定这样:

while (<DATA>) {
    s/1X50_(LI|RE)_//;
    s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
    s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
    s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
    print;
}

__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

输出:

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

答案 2 :(得分:1)

use strict;
use warnings;

my $match 
    = qr/
    ( \d+          # group of digits
      _            # followed by an underscore
    )              # end group
    ( \p{Alpha}+ ) # group of alphas             
    ( \d+ )        # group of digits
    ( \p{Alpha}* ) # group of alphas
    ( \w+ )        # group of word characters
    /x
    ;

while ( my $record = <$input> ) { # record of input
    # match and capture
    if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
        say $pre 
             # if the alpha has length 1, add a dash before it
          . ( length $pre_alpha == 1 ? '-' : '' )
            # then the alpha
          . $pre_alpha
            # then the underscore
          . '_'
            # test if the length of the number is 1 and the length of the 
            # trailing alpha string is 1 
          . ( length( $num ) == 1 && length( $post_alpha ) == 1
              # if true, apply a dash before each 
            ? "-$num\_-$post_alpha" 
              # otherwise treat as AV21NA in example.
            : "$num\_$post_alpha"
            )
          . $post
          ;

    }
}

答案 3 :(得分:1)

#!/usr/bin/perl -w
use strict;
while (<>) {
    next if /^\s*$/;
    chomp;
    ## Remove those parts of the line we do not want
    ## You do not specify what, if anything, is constant about
    ## the parts you do not want. One of the following cases should
    ## serve.

    ## i) Remove the string _1X50_ and the next characters between
    ## two underscores:
    s/_1X50_.+?_/_/;

    ## ii) keep the first 2 and last 3 sections of each line.
    ## Uncomment this line and comment the previous one to use this:
    #s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;

    ## The line now contains only those regions we are 
    ## interested in. Split on '_' to collect an array of the
    ## different parts (@a):
    my @a=split(/_/);

    ## $a[1] is the second string, eg AX1A,W1A etc.
    ## We search for one or more letters, followed by one or more digits
    ## followed by one or more letters. The 'i' operand makes the match
    ## case Insensitive and the 'g' operand makes the search global, allowing
    ## us to capture the matches in the @matches array. 
    my @matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);

    ## So, for each of the matched strings, if the length of the match
    ## is less than 2, add a '-' to the beginning of the string:
    foreach my $match (@matches) {
        if (length($match)<2) {
        $match="-" . $match;
        }
    }
    ## Now replace the original $a[1] with each string in
    ## @matches, connected by '_':
    $a[1]=join("_", @matches);

    ## Finally, build the string $kk by joining each element
    ## of the line (@a) by a '_', and print:
    my $kk=join("_", @a);
    print "$kk\n";
}

答案 4 :(得分:-1)

如果你是一个正则表达式的初学者,zostay关于分割线的建议可能会让事情变得更容易。但是,从性能角度来看,避免拆分是最佳选择。以下是不分裂的方法:

open IN_FILE, "filename" or die "Whoops!  Can't open file.";
while (<IN_FILE>)
{
     s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/ 
          or print "line didn't match: $line\n";
     s/1X50_(LI|RE)_//;
}

打破第一种模式: s///是搜索和替换运算符。 ^匹配行的开头 \d{7}_匹配七位数,后跟一个下划线 \K后视操作员。这意味着之前发生的任何事情都不会成为被替换的字符串的一部分。 ()每组括号指定将捕获的匹配块。这些将按顺序放入匹配变量$ 1,$ 2等。 [A-Z]{1,2}这意味着匹配一到两个大写字母。您可以弄清楚括号中其他两个部分的含义。 -${1}-${2}-${3}替换与前三个匹配变量匹配的内容,前面有短划线。花括号的唯一原因是要弄清楚变量名是什么。