我需要一个文件处理任务。我有两个文件(matched_sequences.list和multiple_hits.list)。
INPUT FILE 1(matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2(multiple_hits.list):
ID1
ID2
ID3
....
我想要做的是将第二列(ID2,ID4等)与存储在multiple_hits.list中的ID列表进行匹配。然后创建一个类似于原始的新的matched_sequences文件,但不包括在multiple_hits.list中找到的所有ID(大约60个中的1000个)。到目前为止,我有:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
我收到以下错误: -bash:read:`matched_sequences.list':不是有效的标识符 非常感谢提前!
预期输出(new_matched_sequences.list):
与INPUT FILE 1相同,所有ID都在multiple_hits.list中排除
答案 0 :(得分:1)
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
用法:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
答案 1 :(得分:0)
可以使用较短的 awk 答案,一个小脚本首先读取包含要排除的ID的文件,然后读取包含序列的文件。脚本如下(注释使它很长,事实上只有三条有用的线:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
如果您调用此脚本exclude.awk
,您将以这种方式调用它:
awk -f exclude.awk multiple_hits.list matched_sequences.list