我无法应用正则表达式来保留列中两个特定连续字符中的一个。我有以下文件,其中C-O出现在1号和2号,如图所示。我想写一个新文件,其中只有1号C-O存在。此功能需要在整个文件中重复,例如在2和3之间(保留编号2),编号3和4(保留编号3)等。
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581
这是我到目前为止所希望的,希望我的逻辑是半清晰的。我还在学习,非常感谢任何评论!
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'data.txt';
open my $fh, '<', $file or die "Can't read $file: $!";
while (my $line = <fh>) {
chomp $line;
my @column = split(/\t/,$line);
if ($column[1] =~ s/COCO/\s+/g) {
print "@columns\n";
}
}
答案 0 :(得分:1)
你可以一次完成这一切。将整个文件读成字符串 然后通过这个正则表达式。
# s/(?m)(^\h+(\d+)\h+C.*\s+^\h+\2\h+O.*\n)\s*^\h+(?!\2)(\d+)\h+C.*\s+^\h+\3\h+O.*\n(?!\s*\z)/$1/g
(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
Perl代码:
use strict;
use warnings;
$/ = "";
my $input = <DATA>;
print "Input:\n$input\n";
$input =~
s/(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
/$1/g;
print "Output:\n$input\n";
__DATA__
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
代码输出:
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581