我终生无法弄清楚如何使用美丽的汤从这样的网页中抓取隔离源信息: https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/
我一直试图检查该标签是否存在,但当我知道它确实存在时,它不断返回它不存在的信息。如果我什至无法验证它是否存在,我不知道如何抓取它。
谢谢!
答案 0 :(得分:1)
当有 NCBI-EUtilities 网络服务时,您不应该刮取ncbi。
wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JOKX00000000.2&rettype=gb&retmode=xml" | xmllint --xpath '//GBQualifier[GBQualifier_name="isolation_source"]/GBQualifier_value/text()' - && echo
Type II sourdough
答案 1 :(得分:0)
数据是从外部 URL 加载的。要获得 isolation_source
,您可以使用以下示例:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
ncbi_uidlist = soup.select_one('[name="ncbi_uidlist"]')["content"]
api_url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi"
params = {
"id": ncbi_uidlist,
"db": "nuccore",
"report": "genbank",
"extrafeat": "null",
"conwithfeat": "on",
"hide-cdd": "on",
"retmode": "html",
"withmarkup": "on",
"tool": "portal",
"log$": "seqview",
"maxdownloadsize": "1000000",
}
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
features = soup.select_one(".feature").text
isolation_source = re.search(r'isolation_source="([^"]+)"', features).group(1)
print(features)
print("-" * 80)
print(isolation_source)
打印:
source 1..12
/organism="Limosilactobacillus reuteri"
/mol_type="genomic DNA"
/strain="TMW1.112"
/isolation_source="Type II sourdough"
/db_xref="taxon:1598"
/country="Germany"
/collection_date="1998"
--------------------------------------------------------------------------------
Type II sourdough