Question

我有这个列表文件如下所示：

1 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTITIKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK V

2 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTITIKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK M

依此类推......第一列是数字，第二列对应蛋白质序列，第三列是最后一个字符，每个案例的相应序列中找到的模式。因此，所需的输出将是这样的：

1：职位：4 23 43 53 56 65 68 91 92 100 120 123 125

2：职位：1 18 22 110 134

我尝试过使用awk和index函数。

nawk -F'\t' -v p=$3 'index($2,p) {printf "%s:positions:", NR; s=$2; m=0; while((n=index(s, p))>0) {m+=n; printf "%s ", m; s=substr(s, n+1)} print ""}' "file.tsv"

然而，它仅将变量-v指定为字符或字符串，但不指定$ 3。如何在unix环境中获取它？提前致谢

Answer 1

你可以这样做：

awk -F'\t' '{ len=split($2,arr,""); printf "%s:positions:",$1 ; for(i=0;i<len;i++) { if(arr[i] == $3 ) { printf "%s ",i } }; print "" }' file.tsv

首先将主题$2完全拆分为一个数组，然后循环它，检查$3是否出现并在找到时打印数组索引

Answer 2

Perl救援：

perl -wane '
    print "$F[0]:positions:";
    $i = 0;
    print " ", $i while ($i = 1 + index $F[1], $F[2], $i) > 0;
    print "\n";
' -- file

如果:之后的空格有问题，可以将其复杂化为

$i = $f = 0;
$f = print " " x $f, $i while ($i = 1 + index $F[1], $F[2], $i) > 0;

Answer 3

gawk 解决方案：

awk -v FPAT="[[:digit:]]+|[[:alpha:]]" '{ 
       r=$1":positions:"; for(i=2;i<NF;i++) { if($i==$NF) r=r" "i-1 } print r 
    }' file.tsv

FPAT="[[:digit:]]+|[[:alpha:]]" - 正则表达式模式定义字段值
for(i=2;i<NF;i++) - 迭代字段（第2列的字母）

输出：

1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134

Answer 4

awk '{
  str=$1":positions:";
  n=0;split($2,a,$3);              # adopt $3 as the delimeter to split $2
  for(i=1;i<length(a);i++){        # save the result to a
    n+=length(a[i])+1;str=str" "n  # locate the delimeter $3 by compute n+length(a[i])+1
  }
  print str
}' file.tsv

Answer 5

$ awk '{out=$1 ":positions:"; for (i=1;i<=length($2);i++) { c=substr($2,i,1); if (c == $3) out = out " " i}; print out}' file
1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134

Answer 6

简单的perl解决方案

use strict;
use warnings;

while( <DATA> ) {
    chomp;

    next if /^\s*$/;        # just in case if you have empty line

    my @data = split "\t";  # record is tabulated

    my %result;             # hash to store result
    my $c = 0;              # position in the string

    map { $c++; push @{$result{$data[0]}}, $c if $_ eq $data[2] } split '', $data[1];

    print "$data[0]:position:"
          . join(' ', @{$result{$data[0]}}) # assemble result to desired form
          . "\n";
}

__DATA__
1   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   V

2   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   M

Answer 7

我会使用一个小脚本，它遍历文件的每一行，将最后一个字段作为search_string，然后使用grep来获取search_string的位置。您现在要做的就是移动结果，因为您的偏移量为1. sed命令从grep输出中删除new lines。

while read p; do
    search_string=`echo $p |awk '{print $NF}'`
    echo $p |grep -aob $search_string  | sed ':a;N;$!ba;s/\n/ /g'
done < file.tsv

当每行在其他列中定义不同的模式时，查找字符串中所有模式出现的位置（UNIX）

7 个答案: