Question

我有一个查询列表并在一个文件（文件1）中命中gi。我有另一个文件，其中有完整的命中名称（file2），现在我想将文件1中的命中gi替换为具有完整命中名称的文件2。我希望gi必须在每个相应的Query之前用相同的gi替换。

文件1

 1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_
 2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_ 
 3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_  
 4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_ 
 5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_

file2的

1  >gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
2  >gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
3  >gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv]
4  >gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
5  >gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]

期望的输出：

1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]

Answer 1

如果我跑：

file1=file1.txt; file2=$(cat file2.txt|sed -e "s/>gi/Query=gi/g"|sed -e "s/_ref_/ ref_/g");IFS='\n';echo $file2| awk  'NR==FNR { _[$2]=$2;  f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 } NR!=FNR { if(_[$2] != "") print $0" "f1_line[key]}'  - $file1

为了解释它作为脚本的用途，下面描述的用法，我在脚本中将文件设置为file1.rasta，因此需要我的输入：

./run.sh 
-------------------------------------------------------------------------------
No variables defined settings files as:
fil1=file1.rasta
file2=file2.rasta
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
One of the following: 
 file1=file1.rasta
file2=file2.rasta
 does not exist!
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
usage:
./run.sh file1.fasta file2.fasta is the same as line below
./run.sh ./file1.fasta ./file2.fasta
-- This is if files are elsewhere
./run.sh /path/to/file1.fasta /path/to/file2.fasta
-------------------------------------------------------------------------------

运行它：

./run.sh ./file1.fasta ./file2.fasta 
1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_   hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]

bash脚本run.sh，这是上面的1个班轮，但按说明分解：

#!/bin/bash

 function line() { 
  echo  -e "-------------------------------------------------------------------------------"
 }

 function usage() { 
  line;
  echo "usage:"
  echo $0 file1.fasta file2.fasta is the same as  line below
  echo $0 ./file1.fasta ./file2.fasta
  echo -- This is if files are elsewhere
  echo $0 /path/to/file1.fasta /path/to/file2.fasta
  line;
 } 


 file1=$1;
 file2=$2;

 if [ $# -lt 2 ]; then 
    # Set file1 variable as filename file1.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file1=/path/to/file1.fasta
    file1=file1.rasta; 


    # Set file2 variable as filename file2.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file2=/path/to/file1.fasta

    file2=file2.rasta;
    line;
    echo -e "No variables defined settings files as:\nfil1=$file1\nfile2=$file2";
    line;
fi
 # Check we have both files whether its variables or if not variables
 # matches defined files
 if  [ ! -f  $file1 ]  || [ ! -f  $file2 ]; then
  line;
  echo -e "One of the following: \n file1=$file1\nfile2=$file2\n does not exist!"
  line
  usage
  exit 2;
 fi


 # Define file 2 variable which cats file2.fasta again like above ensure 
 # the file2.fasta can be catted from this path, it pipes it into sed and changes:
 # '>gi' to 'Query=gi' and also changes '_ref_'  to ' ref_'
 # this now matches the same pattern as file1

 cfile2=$(sed -e "s/>gi/Query=gi/g" -e "s/_ref_/ ref_/g" $file2);

 # Set the internal field separator to \n which is the output of variable file2 
 IFS='\n';

  # debug enable this if you now want to see manipulated file2
  # echo $cfile2

 # Echo out cfile2 which now with the above ifs makes it like the file 
 #  formatting making \n the separator - pipe into awk command which 
 # matches against both files
 # Set up a key whilst in one which contains pattern match after:
 # .{number}_{space}* where this is what separates file2's content where tag starts.
 # If the values from $2 match on both lines print out $0 which is everything from file1 
 # plus the key which contains the details
 # the echo $cfile2  is then represented as - before $file1  at the end in effect its the first file value which is the call to file1 

echo $cfile2| awk 'NR==FNR { 
  _[$2]=$2; 
  if( match($0, /\.[0-9]\_ /)) { 
    var1=substr($0, RSTART+3);  
   }
  } 
  NR!=FNR { 
     if(_[$2] != "") print $0" "var1
  }' - $file1

## Method used originally - updated to above which is much cleaner
## pattern matches and then from that point it captures entire string which would 
## ensure it captures the entire tag from file2
 ##echo $cfile2| awk 'NR==FNR { 
 ## _[$2]=$2; 
 ## f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 
 ## } 
 ## NR!=FNR { 
 ##    if(_[$2] != "") print $0" "f1_line[key]
 ## }' - $file1

Answer 2

最快（和虚拟）解决方案之一是使用pyhton中的搜索方法 - re 来匹配字符串中的模式。我写了一个如何做到的例子（你必须做一些检查以确定结果是否正确......）：

import re

file2 = open(f2path, "r")
file1 = open(f1path, "r")
file3 = open(f3path, "w")
namesD = dict()

for lineO in file2:
    strH = re.search(" ", line0)
    idN = line0[1:strH.begin()]
    namesD[idN] = line0[strH.end():]

for lineO in file1:
    strH = re.search("Hit=", line0)
    idN = line0[strH.end():].strip().replace(' ', '_')
    if namesD[idN] : 
        file3.write("Hit=" + idN + namesD[idN])

这个想法是首先从文件2中提取id及其名称并将它们添加到dict中（id是键，名称是值）然后你应该逐行读取第一个文件并提取来自点击的ID并尝试在dict中匹配它。如果它们匹配，您可以将结果写在第3个文件中......或者随意做任何事情

Answer 3

逐步描述解决方案;

仅从file1中提取命中GI;

cat file1 | awk '{print $3}' | sed 's/Hit=//g' > file1-gi

从文件2中删除# >。
```
sed 's/^....//g' file2 > file2_1
```
删除file2中的冗余（如果有）;
```
cat file2_1 | sort $1 | uniq > file2_2
```

使用system命令grep相应GI的名称;

cat file1-gi | awk '{system ("grep "$1" file2_2")}' >> file1-gi-name

打印文件1的3列;
```
cut -d" " -f-3 file1 > file1_1
```
粘贴两个文件;
```
paste file1_1 file1-gi-name > output
```

如何从file2替换相同数量的file1

3 个答案: