好的,所以我对这类事情都很陌生,所以请耐心等待。
我有两个文件:
search_results_accesions.txt
是标识符列表,每行一个。它看起来像这样(请注意,并非所有标识符都以&#34开头; NP _"):
$ more search_results_accessions.txt
NP_000020.1
NP_000026.2
NP_000027.2
NP_000029.2
NP_000034.1
NP_000042.3
NP_000056.2
NP_000063.2
NP_000065.1
NP_000068.1
NP_000088.3
NP_000112.1
NP_000117.1
NP_000147.1
NP_000156.1
NP_000167.1
NP_000205.1
NP_000228.1
NP_000241.1
NP_000305.3
NP_000347.2
NP_000354.4
NP_000370.2
prot.accession2taxid.txt
是一个文件,列出了每个标识符(以及许多,不在我的列表中的更多标识符),并提供相应的taxid
。这是看起来像(第三列包含taxid
s):
$ more prot.accession2taxid
accession accession.version taxid gi
APZ74649 APZ74649.1 36984 1137646701
AQT41667 AQT41667.1 1686310 1150388099
WP_080502060 WP_080502060.1 95486 1169627919
ASF53620 ASF53620.1 492670 1211447116
ASF53621 ASF53621.1 492670 1211447117
ASF53622 ASF53622.1 492670 1211447118
ASF53623 ASF53623.1 492670 1211447119
ASF53624 ASF53624.1 492670 1211447120
ASF53625 ASF53625.1 492670 1211447121
ASF53626 ASF53626.1 492670 1211447122
ASF53627 ASF53627.1 492670 1211447123
ASF53628 ASF53628.1 492670 1211447124
ASF53629 ASF53629.1 492670 1211447125
ASF53630 ASF53630.1 492670 1211447126
ASF53631 ASF53631.1 492670 1211447127
ASF53632 ASF53632.1 492670 1211447128
ASF53633 ASF53633.1 492670 1211447129
APZ74650 APZ74650.1 36984 1137646703
APZ74651 APZ74651.1 36984 1137646705
APZ74652 APZ74652.1 36984 1137646707
APZ74653 APZ74653.1 36984 1137646709
APZ74654 APZ74654.1 36984 1137646711
字段以制表符分隔。
我需要为taxid
文件中的每个accession
获取searchresults_accessions.txt
。我是在Unix系统上,如果可能的话,我更喜欢使用命令行或Python。
答案 0 :(得分:1)
这是使用python和pandas
模块的解决方案。
我对您的文件进行了一些修改以使其工作(在第一个文件的顶部添加了一个列名,并在第二个文件中用单个选项卡替换了多个选项卡)。假设您有以下文件file1.txt
:
accession.version
NP_000020.1
NP_000026.2
NP_000027.2
NP_000029.2
NP_000034.1
NP_000042.3
NP_000056.2
NP_000063.2
NP_000065.1
NP_000068.1
NP_000088.3
NP_000112.1
NP_000117.1
NP_000147.1
NP_000156.1
NP_000167.1
NP_000205.1
NP_000228.1
NP_000241.1
NP_000305.3
NP_000347.2
NP_000354.4
NP_000370.2
和file2.txt
:
accession accession.version taxid gi
APZ74649 APZ74649.1 36984 1137646701
AQT41667 AQT41667.1 1686310 1150388099
WP_080502060 WP_080502060.1 95486 1169627919
ASF53620 ASF53620.1 492670 1211447116
ASF53621 ASF53621.1 492670 1211447117
ASF53622 ASF53622.1 492670 1211447118
ASF53623 ASF53623.1 492670 1211447119
ASF53624 ASF53624.1 492670 1211447120
ASF53625 ASF53625.1 492670 1211447121
ASF53626 ASF53626.1 492670 1211447122
ASF53627 ASF53627.1 492670 1211447123
ASF53628 ASF53628.1 492670 1211447124
NP_000088 NP_000088.3 62163 3543665822
ASF53629 ASF53629.1 492670 1211447125
ASF53630 ASF53630.1 492670 1211447126
ASF53631 ASF53631.1 492670 1211447127
ASF53632 ASF53632.1 492670 1211447128
ASF53633 ASF53633.1 492670 1211447129
APZ74650 APZ74650.1 36984 1137646703
APZ74651 APZ74651.1 36984 1137646705
APZ74652 APZ74652.1 36984 1137646707
APZ74653 APZ74653.1 36984 1137646709
APZ74654 APZ74654.1 36984 1137646711
NP_000117 NP_000117.1 65683 3543634522
您可以执行以下操作:
import pandas as pd
df1 = pd.read_csv('file1.txt', delimiter='\t')
df2 = pd.read_csv('file2.txt', delimiter='\t')
df = df1.merge(df2)
# accession.version accession taxid gi
# 0 NP_000088.3 NP_000088 62163 3543665822
# 1 NP_000117.1 NP_000117 65683 3543634522
如果您只对出租车感兴趣:
taxid = df.taxid
# 0 62163
# 1 65683
# Name: taxid, dtype: int64
答案 1 :(得分:0)
这是一个awk
的解决方案(您确实说过命令行或Python ):
awk 'NR==FNR {ids[$1]=1} NR>FNR && ($1 in ids) {print $1 "\t" $3}' accessions taxids
说明:
accessions
文件,然后阅读taxids
NR==FNR
)时,我们将第一列中的值添加到关联映射{{1} }