Question

最近，使用 Biopython 从Pubmed中提取一些摘要。我的代码用 Python3 编写，如下所示：

from Bio import Entrez

Entrez.email = "myemail@example.com"    # Always tell NCBI who you are


def get_number():    #Get the total number of abstract available in Pubmed
    handle = Entrez.egquery(term="allergic contact dermatitis ")
    record = Entrez.read(handle)
    for row in record["eGQueryResult"]:
        if row["DbName"]=="pubmed":
            return int(row["Count"])


def get_id():    #Get all the ID of the abstract available in Pubmed
    handle = Entrez.esearch(db="pubmed", term="allergic contact dermatitis ", retmax=200)
    record = Entrez.read(handle)
    idlist = record["IdList"]
    return idlist

idlist = get_id()

for ids in idlist:    #Download the abstract based on their ID
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")    # Retmode Can Be txt / json / xml / csv
    f = open("{}.txt".format(ids), "w")    # Create a TXT file with the name of ID
    f.write(handle.read())    #Write the abstract to the TXT file

我希望得到 200 摘要，但它只能成功获得三到四个摘要。然后，出现错误：

UnicodeDecodeError: 'cp950' codec can't decode byte 0xc5 in position 288: illegal multibyte sequence

handle.read()似乎对那些具有某些符号或单词的摘要有问题。我尝试使用print来了解handle的类：

handle = Entrez.efetch(db="pubmed", id=idlist, rettype="abstract", retmode="text")
print(handle)

结果是：

<_io.TextIOWrapper encoding='cp950'>

我已经搜索过很多页面的解决方案，但没有一个能够正常工作。有人可以帮忙吗？

Answer 1

对我来说，你的代码运行正常。它是您网站上的编码问题。你可以在字节模式下打开文件并在utf-8中编码文本您可以尝试这样的解决方法：

for ids in idlist:    #Download the abstract based on their ID
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")    # Retmode Can Be txt / json / xml / csv
    f = open("{}.txt".format(ids), "wb")    # Create a TXT file with the name of ID
    f.write(handle.read().encode('utf-8'))

UnicodeDecodeError使用Biopython从efetch获取摘要

1 个答案: