我在使用 Beautiful Soup 从 NCBI 网站抓取数据时遇到困难

时间:2021-04-23 17:43:55

标签: web-scraping beautifulsoup bioinformatics ncbi

我终生无法弄清楚如何使用美丽的汤从这样的网页中抓取隔离源信息: https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/

我一直试图检查该标签是否存在,但当我知道它确实存在时,它不断返回它不存在的信息。如果我什至无法验证它是否存在,我不知道如何抓取它。

谢谢!

2 个答案:

答案 0 :(得分:1)

当有 NCBI-EUtilities 网络服务时,您不应该刮取ncbi。

wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JOKX00000000.2&rettype=gb&retmode=xml" | xmllint --xpath '//GBQualifier[GBQualifier_name="isolation_source"]/GBQualifier_value/text()' - && echo

Type II sourdough

答案 1 :(得分:0)

数据是从外部 URL 加载的。要获得 isolation_source,您可以使用以下示例:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
ncbi_uidlist = soup.select_one('[name="ncbi_uidlist"]')["content"]

api_url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi"

params = {
    "id": ncbi_uidlist,
    "db": "nuccore",
    "report": "genbank",
    "extrafeat": "null",
    "conwithfeat": "on",
    "hide-cdd": "on",
    "retmode": "html",
    "withmarkup": "on",
    "tool": "portal",
    "log$": "seqview",
    "maxdownloadsize": "1000000",
}

soup = BeautifulSoup(
    requests.get(api_url, params=params).content, "html.parser"
)
features = soup.select_one(".feature").text

isolation_source = re.search(r'isolation_source="([^"]+)"', features).group(1)
print(features)
print("-" * 80)
print(isolation_source)

打印:

     source          1..12
                     /organism="Limosilactobacillus reuteri"
                     /mol_type="genomic DNA"
                     /strain="TMW1.112"
                     /isolation_source="Type II sourdough"
                     /db_xref="taxon:1598"
                     /country="Germany"
                     /collection_date="1998"

--------------------------------------------------------------------------------
Type II sourdough