Question

这肯定是awk或其他方面的一项微不足道的任务，但这让我今天早上摸不着头脑。我有一个格式与此类似的文件：

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

我想在第2列中为每个不同的肽值打印一行，这意味着上面的输入将成为：

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

这是我到目前为止所尝试过的，但显然我也不需要：

awk '{print $2}' file | sort | uniq
# Prints only the peptides...
awk '{print $0, "\t", $1}' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...

最后一件事，它需要将作为其他肽的子串的肽作为不同的值（例如VSSILED和VSSILEDKILSR）进行处理。谢谢:)）

Answer 1

只需使用sort：

sort -k 2,2 -u file

-u删除重复的条目（如您所愿），而-k 2,2只使字段2成为排序字段（在检查重复项时忽略其余条目）。

Answer 2

使用awk的一种方式：

awk '!array[$2]++' file.txt

结果：

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

Answer 3

我会使用Perl：

perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt

n开关逐行与输入一起工作，a开关将该行拆分为@F数组。

Answer 4

awk '{if($2==temp){next;}else{print}temp=$2}' your_file

测试如下：

> awk '{if($2==temp){next;}else{print}temp=$2}' temp
pep> AEYTCVAETK         2       genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK            1       genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR      5       genes ADUm.367
pep> VSSILEDKTT         9       genes ADUm.1192,ADUm.2731
pep> AIQLTGK            10      genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR       3       genes ADUm.2146,ADUm.5750

为列的每个唯一值输出整行一次（Bash）

4 个答案: