Question

我正在寻找一种程序化方法来获取给定物种的所有Uniprot ID（Swiss-Prot + TrEMBL）（例如以_MOUSE结尾的所有Uniprot ID）。

一种方法是解压缩并解析在uniprot 的流

此类文件仅适用于Uniprot DB中表示的所有物种的非常小的子集。因此，这种解决方案不是一般的解决方案。

我的问题是：是否有一种通用的，希望更有效的方式来做到这一点？（更高效＆＃34;我的意思是基本上它不需要这样的解压缩和解析。）

基本上我想知道uniprot.org是否支持基于网址的查询，我可以在其中指定一些物种标识符（例如MOUSE或10090），也可能还有一些字段名称，如{ {1}}，其响应将是该物种的所有Uniprot ID的列表。

Answer 1

我没有看过你正在拍摄的idmapping文件。但我使用以下文件来获取给定物种的ID：ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/speindex.txt

然后我这样解析它：

#!/usr/bin/env perl
use strict;
use warnings;

my $spec = shift;
my $re = quotemeta $spec;

my @ids =();
while (<>) {
  if (/$re/../^$/) {
    chomp;
    next if ($_ eq $spec);  # skip species line
    s/^\s+//;               # remove trailing spaces
    push @ids, split(/, ?/, $_);
  }
}

print $_."\n" foreach @ids;

使用'Mus musculus（Mouse）'的命令行：

script.pl "Mus musculus (Mouse)" speindex.txt

我希望这有帮助......保罗

Answer 2

您可以使用uniprot.org上提供的其他API执行此操作，请参阅faq on retrieving entries via queries。

大多数情况下，您希望使用NCBI / UniProt分类标识符而不是物种名称。例如10090而不是“Mus musculus”使用id代替字符串更有可能得到正确的东西。

物种概念现在越来越有趣，越来越多的测序项目，所以要注意你得到的和为什么。

Answer 3

如果您不想使用平面文件，可以使用BioServices Python软件包，它将从UniProt网站检索信息：

from bioservices import UniProt
u = UniProt()
results = u.search("organism:10090+and+reviewed:yes", columns="id,entry name", limit=2)
print(results)

结果变量是您需要解析的字符串。它包含uniprot条目和uniprot条目名称。上一个命令只检索了2个条目，但如果你删除参数limit = 2，你将得到所有这些条目。

例如，要获取所有条目名称，请键入：

results = u.search("organism:10090+and+reviewed:yes", columns="id,entry name", limit=2)
entries = [x.split()[1] for x in res.strip().split("\n")[1:]]

下载17000个条目需要几秒钟。如果删除“已审核：是”，则大约需要30秒到一分钟。

我希望这会有所帮助。

对于使用python 2.7安装，只需输入：

pip install bioservices

如何查询uniprot.org获取给定物种的所有Uniprot ID？

3 个答案: