我有一个查询列表并在一个文件(文件1)中命中gi。我有另一个文件,其中有完整的命中名称(file2),现在我想将文件1中的命中gi替换为具有完整命中名称的文件2。我希望gi必须在每个相应的Query之前用相同的gi替换。
文件1
1 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_
2 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_
3 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_
4 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_
5 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_
file2的
1 >gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
2 >gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
3 >gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv]
4 >gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
5 >gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
期望的输出:
1 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
2 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
3 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv
4 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
5 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
答案 0 :(得分:0)
如果我跑:
file1=file1.txt; file2=$(cat file2.txt|sed -e "s/>gi/Query=gi/g"|sed -e "s/_ref_/ ref_/g");IFS='\n';echo $file2| awk 'NR==FNR { _[$2]=$2; f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 } NR!=FNR { if(_[$2] != "") print $0" "f1_line[key]}' - $file1
为了解释它作为脚本的用途,下面描述的用法,我在脚本中将文件设置为file1.rasta,因此需要我的输入:
./run.sh
-------------------------------------------------------------------------------
No variables defined settings files as:
fil1=file1.rasta
file2=file2.rasta
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
One of the following:
file1=file1.rasta
file2=file2.rasta
does not exist!
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
usage:
./run.sh file1.fasta file2.fasta is the same as line below
./run.sh ./file1.fasta ./file2.fasta
-- This is if files are elsewhere
./run.sh /path/to/file1.fasta /path/to/file2.fasta
-------------------------------------------------------------------------------
运行它:
./run.sh ./file1.fasta ./file2.fasta
1 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
2 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
3 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
4 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
5 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
bash脚本run.sh,这是上面的1个班轮,但按说明分解:
#!/bin/bash
function line() {
echo -e "-------------------------------------------------------------------------------"
}
function usage() {
line;
echo "usage:"
echo $0 file1.fasta file2.fasta is the same as line below
echo $0 ./file1.fasta ./file2.fasta
echo -- This is if files are elsewhere
echo $0 /path/to/file1.fasta /path/to/file2.fasta
line;
}
file1=$1;
file2=$2;
if [ $# -lt 2 ]; then
# Set file1 variable as filename file1.fasta
# ensure this file exists in current path
# otherwise:
# file1=/path/to/file1.fasta
file1=file1.rasta;
# Set file2 variable as filename file2.fasta
# ensure this file exists in current path
# otherwise:
# file2=/path/to/file1.fasta
file2=file2.rasta;
line;
echo -e "No variables defined settings files as:\nfil1=$file1\nfile2=$file2";
line;
fi
# Check we have both files whether its variables or if not variables
# matches defined files
if [ ! -f $file1 ] || [ ! -f $file2 ]; then
line;
echo -e "One of the following: \n file1=$file1\nfile2=$file2\n does not exist!"
line
usage
exit 2;
fi
# Define file 2 variable which cats file2.fasta again like above ensure
# the file2.fasta can be catted from this path, it pipes it into sed and changes:
# '>gi' to 'Query=gi' and also changes '_ref_' to ' ref_'
# this now matches the same pattern as file1
cfile2=$(sed -e "s/>gi/Query=gi/g" -e "s/_ref_/ ref_/g" $file2);
# Set the internal field separator to \n which is the output of variable file2
IFS='\n';
# debug enable this if you now want to see manipulated file2
# echo $cfile2
# Echo out cfile2 which now with the above ifs makes it like the file
# formatting making \n the separator - pipe into awk command which
# matches against both files
# Set up a key whilst in one which contains pattern match after:
# .{number}_{space}* where this is what separates file2's content where tag starts.
# If the values from $2 match on both lines print out $0 which is everything from file1
# plus the key which contains the details
# the echo $cfile2 is then represented as - before $file1 at the end in effect its the first file value which is the call to file1
echo $cfile2| awk 'NR==FNR {
_[$2]=$2;
if( match($0, /\.[0-9]\_ /)) {
var1=substr($0, RSTART+3);
}
}
NR!=FNR {
if(_[$2] != "") print $0" "var1
}' - $file1
## Method used originally - updated to above which is much cleaner
## pattern matches and then from that point it captures entire string which would
## ensure it captures the entire tag from file2
##echo $cfile2| awk 'NR==FNR {
## _[$2]=$2;
## f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10
## }
## NR!=FNR {
## if(_[$2] != "") print $0" "f1_line[key]
## }' - $file1
答案 1 :(得分:0)
最快(和虚拟)解决方案之一是使用pyhton中的搜索方法 - re 来匹配字符串中的模式。我写了一个如何做到的例子(你必须做一些检查以确定结果是否正确......):
import re
file2 = open(f2path, "r")
file1 = open(f1path, "r")
file3 = open(f3path, "w")
namesD = dict()
for lineO in file2:
strH = re.search(" ", line0)
idN = line0[1:strH.begin()]
namesD[idN] = line0[strH.end():]
for lineO in file1:
strH = re.search("Hit=", line0)
idN = line0[strH.end():].strip().replace(' ', '_')
if namesD[idN] :
file3.write("Hit=" + idN + namesD[idN])
这个想法是首先从文件2中提取id及其名称并将它们添加到dict中(id是键,名称是值)然后你应该逐行读取第一个文件并提取来自点击的ID并尝试在dict中匹配它。如果它们匹配,您可以将结果写在第3个文件中......或者随意做任何事情
答案 2 :(得分:0)
逐步描述解决方案;
仅从file1中提取命中GI;
cat file1 | awk '{print $3}' | sed 's/Hit=//g' > file1-gi
从文件2中删除# >
。
sed 's/^....//g' file2 > file2_1
删除file2中的冗余(如果有);
cat file2_1 | sort $1 | uniq > file2_2
使用system命令grep相应GI的名称;
cat file1-gi | awk '{system ("grep "$1" file2_2")}' >> file1-gi-name
打印文件1的3列;
cut -d" " -f-3 file1 > file1_1
粘贴两个文件;
paste file1_1 file1-gi-name > output