我有下一个输入文件:
##Names
##Something
FVEG_04063 1265 . AA ATTAT DP=19
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
我想要的输出文件:
##Names
##Something
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
说明:我想在我的输出文件中打印,所有的行都以"#"开头,所有"唯一的"参加第1列的行,如果我在第1列中重复点击,首先:取2美元的数字,加上5美元的长度(在同一行),如果结果小于下一行的2美元,则打印两行;但是如果结果大于下一行的$ 2,则比较DP的值并仅打印具有最佳DP的行。
我尝试过的事情:
awk '/^#/ {print $0;} arr[$1]++; END {for(i in arr){ if(arr[i]>1){ HERE I NEED TO INTRODUCE MORE 'IF' I THINK... } } { if(arr[i]==1){print $0;} } }' file.txt
我是awk世界的新手......我认为用多行编写一个小脚本更简单......或者更好的是bash解决方案。
提前致谢
答案 0 :(得分:1)
Perl解决方案。您可能需要修复边框情况,因为您没有提供数据来测试它们。
@last
会记住最后一行,@F
是当前行。
#!/usr/bin/perl
use warnings;
use strict;
my (@F, @last);
while (<>) {
@F = split;
print and next if /^#/ or not @last;
if ($last[0] eq $F[0]) {
if ($F[1] + length $F[4] > $last[1] + length $last[4]) {
print "@last\n";
} else {
my $dp_l = $last[5];
my $dp_f = $F[5];
s/DP=// for $dp_l, $dp_f;
if ($dp_l > $dp_f) {
@F = @last;
}
}
} else {
print "@last\n" if @last;
}
} continue {
@last = @F;
}
print "@last\n";
答案 1 :(得分:1)
根据要求,awk
解决方案。我对代码进行了大量评论,所以希望这些评论可以作为解释。总结一下,基本思路是:
代码:
# Print lines starting with '#' and go to next line.
/^#/ { print $0; next; }
# Set up variables on the first line of input and go to next line.
! col1 { # If col1 is unset:
col1 = $1;
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0; # Note dp is turned into int here by +0
best = $0;
next;
}
# For all other lines of input:
{
# If col1 is the same as previous line:
if ($1 == col1) {
# Check col2
if (len5 + col2 < $2) # Previous len5 + col2 < current $2
print best; # Print previous record
# Check DP
else if (substr($6, 4) + 0 < dp) # Current dp < previous dp:
next; # Go to next record, do not update variables.
}
else { # Different ids, print best line from previous id and update id.
print best;
col1 = $1;
}
# Update variables to current record.
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0;
best = $0;
}
# Print the best record of the last id.
END { print best }
注意:dp
的计算方法是从{4}开始,从索引4开始,然后结束。添加$6
以强制将值转换为整数,以确保比较按预期工作。