Question

我有一个带有GI编号的文件，希望从ncbi获得FASTA个序列。

from Bio import Entrez
import time
Entrez.email ="eigtw59tyjrt403@gmail.com"
f = open("C:\\bioinformatics\\gilist.txt")
for line in iter(f):
    handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
    records = Entrez.read(handle)
    print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"]
    time.sleep(1) # to make sure not many requests go per second to ncbi
f.close()

这个脚本运行正常但是在几个序列之后我突然收到此错误消息。

Traceback (most recent call last):
  File "C:/Users/Ankur/PycharmProjects/ncbiseq/getncbiSeq.py", line 7, in <module>
    handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
  File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 139, in efetch
    return _open(cgi, variables)
  File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 455, in _open
    raise exception
urllib2.HTTPError: HTTP Error 500: Internal Server Error

当然我可以使用http://www.ncbi.nlm.nih.gov/sites/batchentrez，但我正在尝试创建一个管道，并希望自动化。

如何防止ncbi“踢我”

Answer 1

我不熟悉ncbi API，但我的猜测是你违反了某种速率限制规则（即使使用“sleep（1）”），所以你之前的请求有效，但是在一些请求之后服务器发现你经常打它并阻止你。这对您来说是有问题的，因为您的代码中没有错误处理。

我建议在try / except块中包装数据，以使脚本等待更长时间，然后再遇到问题再试一次。如果所有其他方法都失败了，请将导致错误的id写入文件并继续（如果id在某种程度上是罪魁祸首，可能导致Entrez库生成错误的URL）。

尝试将代码更改为此类代码（未经测试）：

from urllib2 import HTTPError
from Bio import Entrez
import time

def get_record(_id):
    handle = Entrez.efetch(db="nucleotide", id=_id, retmode="xml")
    records = Entrez.read(handle)
    print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"]
    time.sleep(1) # to make sure not many requests go per second to ncbi

Entrez.email ="eigtw59tyjrt403@gmail.com"
f = open("C:\\bioinformatics\\gilist.txt")
for id in iter(f):
    try:
        get_record(id)
    except HTTPError:
        print "Error fetching", id
        time.sleep(5) # we have angered the API! Try waiting longer?
        try:
            get_record(id)
        except:
            with open('error_records.bad','a') as f:
                f.write(str(id)+'\n')
            continue # 
f.close()

Answer 2

有一种叫做efetch的工作。您可以将列表拆分为200个批次（直觉感觉这是一个正确的批量大小）并使用efetch一次性发送所有这些ID。

首先，这比发送200个单独查询要快得多。其次，它还有效地符合“每秒3个查询”规则，因为每个查询的处理时间超过0.33秒但不会太长。

但是，你确实需要一种机制来捕捉“坏苹果”。即使你的200个ID中有一个是坏的，NCBI也会返回0结果。换句话说，当且仅当所有200个ID都有效时，NCBI才返回结果。

如果是坏苹果，我会逐个遍历200个ID并忽略坏苹果。这个“如果糟糕的苹果”情景也告诉你不要让批次太大，只是在坏苹果的情况下。如果它很大，首先，有一个坏苹果的机会更大，也就是说，你经常需要迭代整个事情。其次，批次越大，您需要迭代的单个项目就越多。

我使用以下代码下载CAZy蛋白，效果很好：

import urllib2


prefix = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id="
id_per_request = 200


def getSeq (id_list):
    url = prefix + id_list[:len(id_list)-1]

    temp_content = ""
    try:
        temp_content += urllib2.urlopen(url).read()

### if there is a bad apple, try one by one
    except:
        for id in id_list[:len(id_list)-1].split(","):
            url = prefix + id
    #print url
            try:
                temp_content += urllib2.urlopen(url).read()
            except:
            #print id
                pass
    return temp_content


content = ""
counter = 0
id_list = ""

#define your accession numbers first, here it is just an example!!

accs = ["ADL19140.1","ABW01768.1","CCQ33656.1"]
for acc in accs:

    id_list += acc + ","
    counter += 1

    if counter == id_per_request:
        counter = 0
        content += getSeq(id_list)
        id_list = ""

if id_list != "":
    content += getSeq(id_list)
    id_list = ""


print content

Answer 3

这是“正常”的Entrez API临时失败，即使您应用了所有Entrez API规则也可能发生。 Biopython文档在this section中介绍了一种处理它的方法。

有时您会从Entrez收到间歇性错误，HTTPError 5XX，我们使用了一个尝试，除了“暂停重试”块来解决此问题。例如，

# This assumes you have already run a search as shown above,
# and set the variables count, webenv, query_key

try:
    from urllib.error import HTTPError  # for Python 3
except ImportError:
    from urllib2 import HTTPError  # for Python 2

batch_size = 3
out_handle = open("orchid_rpl16.fasta", "w")
for start in range(0, count, batch_size):
    end = min(count, start+batch_size)
    print("Going to download record %i to %i" % (start+1, end))
    attempt = 0
    while attempt < 3:
        attempt += 1
        try:
            fetch_handle = Entrez.efetch(db="nucleotide",
                                         rettype="fasta", retmode="text",
                                         retstart=start, retmax=batch_size,
                                         webenv=webenv, query_key=query_key,
                                         idtype="acc")
        except HTTPError as err:
            if 500 <= err.code <= 599:
                print("Received error from server %s" % err)
                print("Attempt %i of 3" % attempt)
                time.sleep(15)
            else:
                raise
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

因此您不必为此错误感到内and，只需抓住它即可。

urllib2.HTTPError Python

3 个答案: