Question

我需要一个文件处理任务。我有两个文件（matched_sequences.list和multiple_hits.list）。

INPUT FILE 1（matched_sequences.list）：

>P001 ID
 ABCD .... (very long string of characters)

>P002 ID
 ABCD .... (very long string of characters)

>P003 ID
ABCD ... ( " " " " )

INPUT FILE 2（multiple_hits.list）：

ID1
ID2
ID3
....

我想要做的是将第二列（ID2，ID4等）与存储在multiple_hits.list中的ID列表进行匹配。然后创建一个类似于原始的新的matched_sequences文件，但不包括在multiple_hits.list中找到的所有ID（大约60个中的1000个）。到目前为止，我有：

#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')

while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done

我收到以下错误： -bash：read：`matched_sequences.list'：不是有效的标识符非常感谢提前！

预期输出（new_matched_sequences.list）：

与INPUT FILE 1相同，所有ID都在multiple_hits.list中排除

Answer 1

#!/usr/bin/awk -f
function chomp(s) {
    sub(/^[ \t]*/, "", s)
    sub(/[ \t\r]*$/, "", s)
    return s
}
BEGIN {
    file = ARGV[--ARGC]
    while ((getline line < file) > 0) {
        a[chomp(line)]++
    }
    RS = ""
    FS = "\n"
    ORS = "\n\n"
}
{
    id = chomp($1)
    sub(/^.* /, "", id)
}
!(id in a)

用法：

awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list

Answer 2

可以使用较短的 awk 答案，一个小脚本首先读取包含要排除的ID的文件，然后读取包含序列的文件。脚本如下（注释使它很长，事实上只有三条有用的线：

BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')

FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }

如果您调用此脚本exclude.awk，您将以这种方式调用它：

awk -f exclude.awk multiple_hits.list matched_sequences.list

如何使用循环结构匹配两个不同文件中的字符串列表？

2 个答案: