如何使用循环结构匹配两个不同文件中的字符串列表?

时间:2014-07-29 20:31:14

标签: unix file-io awk

我需要一个文件处理任务。我有两个文件(matched_sequences.list和multiple_hits.list)。

INPUT FILE 1(matched_sequences.list):

>P001 ID
 ABCD .... (very long string of characters)

>P002 ID
 ABCD .... (very long string of characters)

>P003 ID
ABCD ... ( " " " " )

INPUT FILE 2(multiple_hits.list):

ID1
ID2
ID3
....

我想要做的是将第二列(ID2,ID4等)与存储在multiple_hits.list中的ID列表进行匹配。然后创建一个类似于原始的新的matched_sequences文件,但不包括在multiple_hits.list中找到的所有ID(大约60个中的1000个)。到目前为止,我有:

#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')

while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done

我收到以下错误: -bash:read:`matched_sequences.list':不是有效的标识符 非常感谢提前!

预期输出(new_matched_sequences.list):

与INPUT FILE 1相同,所有ID都在multiple_hits.list中排除

2 个答案:

答案 0 :(得分:1)

#!/usr/bin/awk -f
function chomp(s) {
    sub(/^[ \t]*/, "", s)
    sub(/[ \t\r]*$/, "", s)
    return s
}
BEGIN {
    file = ARGV[--ARGC]
    while ((getline line < file) > 0) {
        a[chomp(line)]++
    }
    RS = ""
    FS = "\n"
    ORS = "\n\n"
}
{
    id = chomp($1)
    sub(/^.* /, "", id)
}
!(id in a)

用法:

awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list

答案 1 :(得分:0)

可以使用较短的 awk 答案,一个小脚本首先读取包含要排除的ID的文件,然后读取包含序列的文件。脚本如下(注释使它很长,事实上只有三条有用的线:

BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')

FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }

如果您调用此脚本exclude.awk,您将以这种方式调用它:

awk -f exclude.awk multiple_hits.list matched_sequences.list