当每行在其他列中定义不同的模式时,查找字符串中所有模式出现的位置(UNIX)

时间:2017-08-03 08:43:20

标签: bash unix awk

我有这个列表文件如下所示:

1 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTITIKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK V

2 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTITIKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK M

依此类推......第一列是数字,第二列对应蛋白质序列,第三列是最后一个字符,每个案例的相应序列中找到的模式。 因此,所需的输出将是这样的:

1:职位:4 23 43 53 56 65 68 91 92 100 120 123 125

2:职位:1 18 22 110 134

我尝试过使用awk和index函数。

nawk -F'\t' -v p=$3 'index($2,p) {printf "%s:positions:", NR; s=$2; m=0; while((n=index(s, p))>0) {m+=n; printf "%s ", m; s=substr(s, n+1)} print ""}' "file.tsv"

然而,它仅将变量-v指定为字符或字符串,但不指定$ 3。如何在unix环境中获取它?提前致谢

7 个答案:

答案 0 :(得分:0)

你可以这样做:

awk -F'\t' '{ len=split($2,arr,""); printf "%s:positions:",$1 ; for(i=0;i<len;i++) { if(arr[i] == $3 ) { printf "%s ",i } }; print "" }' file.tsv

首先将主题$2完全拆分为一个数组,然后循环它,检查$3是否出现并在找到时打印数组索引

答案 1 :(得分:0)

Perl救援:

perl -wane '
    print "$F[0]:positions:";
    $i = 0;
    print " ", $i while ($i = 1 + index $F[1], $F[2], $i) > 0;
    print "\n";
' -- file

如果:之后的空格有问题,可以将其复杂化为

$i = $f = 0;
$f = print " " x $f, $i while ($i = 1 + index $F[1], $F[2], $i) > 0;

答案 2 :(得分:0)

gawk 解决方案:

awk -v FPAT="[[:digit:]]+|[[:alpha:]]" '{ 
       r=$1":positions:"; for(i=2;i<NF;i++) { if($i==$NF) r=r" "i-1 } print r 
    }' file.tsv
  • FPAT="[[:digit:]]+|[[:alpha:]]" - 正则表达式模式定义字段值

  • for(i=2;i<NF;i++) - 迭代字段(第2列的字母)

输出:

1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134

答案 3 :(得分:0)

awk '{
  str=$1":positions:";
  n=0;split($2,a,$3);              # adopt $3 as the delimeter to split $2
  for(i=1;i<length(a);i++){        # save the result to a
    n+=length(a[i])+1;str=str" "n  # locate the delimeter $3 by compute n+length(a[i])+1
  }
  print str
}' file.tsv

答案 4 :(得分:0)

$ awk '{out=$1 ":positions:"; for (i=1;i<=length($2);i++) { c=substr($2,i,1); if (c == $3) out = out " " i}; print out}' file
1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134

答案 5 :(得分:0)

简单的perl解决方案

use strict;
use warnings;

while( <DATA> ) {
    chomp;

    next if /^\s*$/;        # just in case if you have empty line

    my @data = split "\t";  # record is tabulated

    my %result;             # hash to store result
    my $c = 0;              # position in the string

    map { $c++; push @{$result{$data[0]}}, $c if $_ eq $data[2] } split '', $data[1];

    print "$data[0]:position:"
          . join(' ', @{$result{$data[0]}}) # assemble result to desired form
          . "\n";
}

__DATA__
1   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   V

2   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   M

答案 6 :(得分:-1)

我会使用一个小脚本,它遍历文件的每一行,将最后一个字段作为search_string,然后使用grep来获取search_string的位置。您现在要做的就是移动结果,因为您的偏移量为1. sed命令从grep输出中删除new lines

while read p; do
    search_string=`echo $p |awk '{print $NF}'`
    echo $p |grep -aob $search_string  | sed ':a;N;$!ba;s/\n/ /g'
done < file.tsv