我有这种类型的数据: 请帮助我,我是正则表达式的新手,请在回答时解释每一步。谢谢..
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
我想从上面的行中仅提取这些数据:
7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
然后如果 AX1A 在下划线后包含两个连续的字母,它应该写成AX_,如果包含单个数字和单个字母,那么它们变为-1_和-A_所以在应用此模式后将成为:AX_-1_-A_和所有其他数据应保持相同。
同样在下一行“W1A”所以首先它包含单个字母“W”,它应该转换为-W_现在下一个字符是单个数字所以它也应该转换为相同的模式-1_同样最后一个也被处理同样,它变成-W_-1_-A _
我们只对将数字后跟下划线应用于部分后感兴趣。
_AX1A_
_W1A_
_U1A_
_AV21NA_
输出应该是:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
答案 0 :(得分:1)
我不知道你需要剥离的所有细节,但我会推断并让你澄清这是否不能满足你的需要。
第一步,提取1X50_RE_
和1X50_LI
,您可以搜索这些字符串并将其替换为空。
接下来,要将您的第二个字母/数字代码拆分成小块,您可以使用一对匹配,每个匹配使用前瞻。但是,由于你只想弄乱第二个代码块,我首先将整个行拆分,然后再处理第二个块,然后再将这些块重新组合在一起。
while (<$input>) {
# Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
s/1X50_(RE|LI)_//;
my @pieces = split /_/; # split the line into pieces at each underscore
# Just working with the second chunk. /g, means do it for all matches found
$pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
$pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A
# Join the pieces back together again
$_ = join '_', @pieces;
print;
}
如果您未指定,$_
是许多Perl操作可以处理的变量。 <$input>
将名为$input
的文件句柄的下一行读入$_
。如果没有给出,s///
,split
和print
函数将在$_
上运行。 =~
运算符是您告诉Perl使用$pieces[1]
(或您正在处理的任何变量)而不是$_
进行正则表达式操作的方式。 (对于split
或print
,您将传递变量作为参数,因此split /_/
与split /_/, $_
相同,print
与print $_
。)
哦,并解释一下正则表达式:
s/1X50_(RE|LI)_//;
这匹配包含1X50_RE
或1X50_LI
的任何内容((|)
是替代列表)并将其替换为空(最后为空//
)。
查看其他一行:
s/([A-Z])(?=[0-9])/$1_-/g;
(...)
周围的普通括号[A-Z]
会将$1
设置为内部匹配的任何字母(在本例中为字母A-Z)。 (?=...)
括号引起零宽度正向前瞻断言。这意味着正则表达式只匹配字符串中的下一个匹配表达式(数字,0-9),但匹配的那部分不包括在被替换的字符串中。
/$1_-/
导致字符串的匹配部分[A-Z]
被替换为括号(...)
捕获的值,但在查找头部之前[0-9]
1}},添加了您需要的_-
。
答案 1 :(得分:1)
你确定这样:
while (<DATA>) {
s/1X50_(LI|RE)_//;
s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
print;
}
__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD
输出:
7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD
答案 2 :(得分:1)
use strict;
use warnings;
my $match
= qr/
( \d+ # group of digits
_ # followed by an underscore
) # end group
( \p{Alpha}+ ) # group of alphas
( \d+ ) # group of digits
( \p{Alpha}* ) # group of alphas
( \w+ ) # group of word characters
/x
;
while ( my $record = <$input> ) { # record of input
# match and capture
if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
say $pre
# if the alpha has length 1, add a dash before it
. ( length $pre_alpha == 1 ? '-' : '' )
# then the alpha
. $pre_alpha
# then the underscore
. '_'
# test if the length of the number is 1 and the length of the
# trailing alpha string is 1
. ( length( $num ) == 1 && length( $post_alpha ) == 1
# if true, apply a dash before each
? "-$num\_-$post_alpha"
# otherwise treat as AV21NA in example.
: "$num\_$post_alpha"
)
. $post
;
}
}
答案 3 :(得分:1)
#!/usr/bin/perl -w
use strict;
while (<>) {
next if /^\s*$/;
chomp;
## Remove those parts of the line we do not want
## You do not specify what, if anything, is constant about
## the parts you do not want. One of the following cases should
## serve.
## i) Remove the string _1X50_ and the next characters between
## two underscores:
s/_1X50_.+?_/_/;
## ii) keep the first 2 and last 3 sections of each line.
## Uncomment this line and comment the previous one to use this:
#s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;
## The line now contains only those regions we are
## interested in. Split on '_' to collect an array of the
## different parts (@a):
my @a=split(/_/);
## $a[1] is the second string, eg AX1A,W1A etc.
## We search for one or more letters, followed by one or more digits
## followed by one or more letters. The 'i' operand makes the match
## case Insensitive and the 'g' operand makes the search global, allowing
## us to capture the matches in the @matches array.
my @matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);
## So, for each of the matched strings, if the length of the match
## is less than 2, add a '-' to the beginning of the string:
foreach my $match (@matches) {
if (length($match)<2) {
$match="-" . $match;
}
}
## Now replace the original $a[1] with each string in
## @matches, connected by '_':
$a[1]=join("_", @matches);
## Finally, build the string $kk by joining each element
## of the line (@a) by a '_', and print:
my $kk=join("_", @a);
print "$kk\n";
}
答案 4 :(得分:-1)
open IN_FILE, "filename" or die "Whoops! Can't open file.";
while (<IN_FILE>)
{
s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/
or print "line didn't match: $line\n";
s/1X50_(LI|RE)_//;
}
打破第一种模式:
s///
是搜索和替换运算符。
^
匹配行的开头
\d{7}_
匹配七位数,后跟一个下划线
\K
后视操作员。这意味着之前发生的任何事情都不会成为被替换的字符串的一部分。 ()
每组括号指定将捕获的匹配块。这些将按顺序放入匹配变量$ 1,$ 2等。 [A-Z]{1,2}
这意味着匹配一到两个大写字母。您可以弄清楚括号中其他两个部分的含义。 -${1}-${2}-${3}
替换与前三个匹配变量匹配的内容,前面有短划线。花括号的唯一原因是要弄清楚变量名是什么。