Question

我有下一个输入文件：

##Names
##Something
FVEG_04063  1265    .   AA  ATTAT   DP=19
FVEG_04063  1266    .   AA  ATTA    DP=45
FVEG_04063  2703    .   GTTTTTTTT   ATA DP=1
FVEG_15672  2456    .   TTG AA  DP=71
FVEG_01111  300 .   CTATA   ATATA   DP=7
FVEG_01111  350 .   AGAC    ATATATG DP=41

我想要的输出文件：

##Names
##Something
FVEG_04063  1266    .   AA  ATTA    DP=45
FVEG_04063  2703    .   GTTTTTTTT   ATA DP=1
FVEG_15672  2456    .   TTG AA  DP=71
FVEG_01111  300 .   CTATA   ATATA   DP=7
FVEG_01111  350 .   AGAC    ATATATG DP=41

说明：我想在我的输出文件中打印，所有的行都以＆＃34;＃＆＃34;开头，所有＆＃34;唯一的＆＃34;参加第1列的行，如果我在第1列中重复点击，首先：取2美元的数字，加上5美元的长度（在同一行），如果结果小于下一行的2美元，则打印两行;但是如果结果大于下一行的$ 2，则比较DP的值并仅打印具有最佳DP的行。

我尝试过的事情：

awk '/^#/ {print $0;} arr[$1]++; END {for(i in arr){ if(arr[i]>1){ HERE I NEED TO INTRODUCE MORE 'IF' I THINK... } } { if(arr[i]==1){print $0;} } }' file.txt

我是awk世界的新手......我认为用多行编写一个小脚本更简单......或者更好的是bash解决方案。

提前致谢

Answer 1

Perl解决方案。您可能需要修复边框情况，因为您没有提供数据来测试它们。

@last会记住最后一行，@F是当前行。

#!/usr/bin/perl
use warnings;
use strict;

my (@F, @last);
while (<>) {
    @F = split;
    print and next if /^#/ or not @last;

    if ($last[0] eq $F[0]) {
        if ($F[1] + length $F[4] > $last[1] + length $last[4]) {
            print "@last\n";

        } else {
            my $dp_l = $last[5];
            my $dp_f = $F[5];
            s/DP=// for $dp_l, $dp_f;

            if ($dp_l > $dp_f) {
                @F = @last;
            }
        }
    } else {
        print "@last\n" if @last;
    }
} continue {
    @last = @F;
}
print "@last\n";

Answer 2

根据要求，awk解决方案。我对代码进行了大量评论，所以希望这些评论可以作为解释。总结一下，基本思路是：

匹配注释行，打印它们，然后转到下一行。
匹配第一行（通过检查我们是否已开始记住col1来完成）。
在所有后续行中，根据上一行中记住的值检查值。最好的＆＃34;记录，即。应该为每个唯一ID打印的那个，每次都会被记住，并根据问题提出的条件进行更新。
最后，输出最后一个＆＃34;最佳＆＃34;记录最后一个唯一ID。

代码：

# Print lines starting with '#' and go to next line.
/^#/ { print $0; next; }

# Set up variables on the first line of input and go to next line.
! col1 { # If col1 is unset:
  col1 = $1; 
  col2 = $2; 
  len5 = length($5); 
  dp = substr($6, 4) + 0; # Note dp is turned into int here by +0
  best = $0; 
  next; 
}

# For all other lines of input:
{
  # If col1 is the same as previous line:
  if ($1 == col1) {
    # Check col2
    if (len5 + col2 < $2) # Previous len5 + col2 < current $2
      print best; # Print previous record
    # Check DP
    else if (substr($6, 4) + 0 < dp) # Current dp < previous dp:
      next; # Go to next record, do not update variables.
  }
  else { # Different ids, print best line from previous id and update id.
    print best;
    col1 = $1;
  }

  # Update variables to current record.
  col2 = $2;
  len5 = length($5);
  dp = substr($6, 4) + 0;
  best = $0;
}

# Print the best record of the last id.
END { print best }

注意：dp的计算方法是从{4}开始，从索引4开始，然后结束。添加$6以强制将值转换为整数，以确保比较按预期工作。

awk if语句和模式匹配

2 个答案: