我需要在特定部分访问DNA序列,但是它们太多了,但是我看到了这样的URL链接模式
https://www.ncbi.nlm.nih.gov/nuccore/ AF193276.1 ?report = fasta&log $ = seqview&format = text&from = 1311 &to = 4322
此链接,我可以从下面的位置1311到4322访问HIV的DNA序列(ID:AF193276.1)。
>AF193276.1:1311-4322 HIV-1 CRF03_AB isolate KAL153 from Russia, complete genome
TTTTTTAGGGAGAATTTGGCCTTCCAGCAAAGGGAGGCCAGGAAATTTTCCTCAGAGCAGACCAGAGCCA
TCAGCCCCACCAGCAGAAAACTTTGGGATGGGGGAAGAGATAACCCCCTCCCTGAAACAGGAACAGAAGG
ACAGGGAACAGCATCCTCCTTCAATTTCCCTCAAATCACTCTTTGGCGACGACCCCTTGTCACAGTAAGA
ATAGGAGGACAGCTAAAAGAAGCTCTATTAGATACAGGAGCAGATGATACAGTATTAGAAGACATAAATT
TGCCAGGAAAATGGAAACCAAAAATGATAGGGGGGATTGGAGGTTTTATCAAGGTAAGACAGTATGATCA
GATACTTATAGAAATTTGTGGAAAAAAGGCTATAGGTACGGTATTAGTAGGACCTACCCCTGTCAACATA
ATTGGAAGAAATATGTTGACTCAGCTTGGTTGTACTTTAAATTTTCCAATAAGTCCTATTGAAACTGTAC
CAGTAACATTAAAGCCAGGAATGGATGGCCCAAAGGTTAAACAATGGCCATTAACAGAAGAGAAAATAAA
AGCATTAACAGACATTTGTAAGGAGATGGAAAAGGAAGGAAAAATTTCAAAAATTGGGCCTGAAAATCCA
TACAATACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGGTTTCAGAG
AACTTAATAAGAGAACTCAAGACTTCTGGGAAGTTCAATTAGGAATACCACACCCTGCAGGGTTAAAAAA
GAAAAAATCTGTAACAGTACTGGATGTGGGTGATGCATATTTTTCAGTTCCCTTAGATCAAGACTTCAGA
AAGTATACTGCATTTACCATACCTAGTACAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGC
TTCCACAGGGATGGAAAGGATCACCAGCAATTTTCCAAAGTAGCATGACAAAAATCTTAGAGCCTTTTAG
AAAACAAAATCCAGAGATAGTTATCTATCAATACATGGATGATTTGTATGTAGGATCTGACTTAGAGATA
GGGCAGCATAGAACAGAAATAGAGGAACTGAGAGAACATCTGCTGAGGTGGGGATTTACCACACCAGACA
AAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACTGTACA
GCCTATAGTGTTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGAAGCTAGTGGGAAAATTGAAT
TGGGCAAGTCAGATTTATGCAGGGATTAAAGTAAGGCAATTATGTAAACTCCTTAGGGGAGCCAAAGCAC
TAACAGAAGTAATACCACTAACAGCAGAAGCAGAGCTAGAACTGGCAGAAAACAGGGAGATTCTAAAAGA
ACCAGTACATGGAGTGTATTATGACCCATCAAAAGACTTAGTAGCAGAAATACAGAAGCAGGGACAAGGC
CAATGGACATATCAAATTTATCAAGAGCCATTTAAAAATCTGAAAACAGGAAAATATGCAAGACTGAGGG
GTGCCCACACTAATGACGTAAAACAGTTAACAGAGGCAGTGCAAAAAATAGCCACTGAAAGCATAGTAAT
ATGGGGAAAGACTCCTAAATTTAAACTACCCATACAAAAAGAAACATGGGAAACATGGTGGACAGAGTAT
TGGCAAGCCACCTGGATTCCTGAGTGGGAATTTGTCAATACCCCTCCCTTAGTAAAATTATGGTACCAGT
TAGAGAAAGAACCCATAGTAGGAGCAGAAACTTTCTATGTAGATGGAGCAGCTAATAGGGAGACTAAATC
AGGAAAAGCAGGATATGTTACTGACAGAGGAAGACAAAAGGTTGTCTCCCTAACTGACACAACAAATCAG
AAGACTGAGTTACAAGCAATTCATCTAGCTTTGCAGGATTCGGGATTAGAAGTAAACATAGTAACAGACT
CACAATATGCATTAGGAATCATTCAAGCACAACCAGATAAGAGTGAATCAGAGTTAGTCAGTCAAATAAT
AGAGCAGTTGATAAAAAAGGAAAAGGTCTACCTGGCATGGGTACCAGCACACAAAGGAATTGGAGGAAAT
GAACAAGTTGATAAATTAGTCAGTGCTGGAATCAGGGAAGTACTATTTTTAGATGGAATAGATAAGGCAC
AAGAAGAACATGAGAAATATCACGGTAATTGGAGAGCAATGGCTAGTGATTTTAACCTGCCACCTGTGGT
AGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAATTAAAAGGAGAAGCCATGCACGGACAAGTAGAC
TGTAGTCCAGGAATATGGCAACTAGATTGTACACATTTAGAAGGAAAAATTATCCTAGTAGCAGTTCATG
TAGCCAGTGGATATATAGAAGCAGAAGTTATTCCAGCAGAAACAGGACAGGAAACAGCATACTTTGTCTT
AAAATTAGCAGGAAGATGGCCAGTAAAAATAATACATACAGACAATGGCAGCAATTTCACCAGTACTGCG
GTTAAGGCTGCCTGTTGGTGGGCAGGGATCAAGCAGGAATTTGGCATTCCCTACAATCCCCAAAGTCAAG
GAGTAGTAGAATCTATGAATAAACAATTAAAGCAAACTATAGGACAGGTAAGAGATCAAGCTGAACATCT
TAAGACAGCAGTACAAATGGCAGTATTCATCCACAATTTTAAAAGAAAAGGGGGGATTGGGGGGTACAGT
GCAGGGGAAAGAATAATAGACATAATAGCAACAGACATACAAACTAAAGAATTACAAAAACAAATTATAA
AAATTCAAAATTTTCGGGTTTATTACAGAGACAGCAGAGATCCAATTTGGAAAGGACCAGCAAAACTACT
CTGGAAAGGTGAAGGGGCAGTGGTAATACAGGACAATAACGATATAAAAGTAGTACCAAGAAGAAAAGCA
AAGATCATTAGGGATTATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATT
AG
除了一些压力外,我需要此网页上的所有信息。
如果我更改菌株ID和DNA位置,我想我可以复制所有这些DNA序列。
我尝试从Copy and paste text from webpage to txt file or csv file
import requests
url = 'https://www.ncbi.nlm.nih.gov/nuccore/AF061641.1?report=fasta&log$=seqview&format=text&from=192&to=1684'
data = requests.get(url)
with open('file.txt','w') as out_f:
out_f.write(str(data.text.encode('utf-8')))
但是我得到了这个
b'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml">\n <head xmlns:xi="http://www.w3.org/2001/XInclude"><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n <!-- meta -->\n <meta name="robots" content="index,nofollow,noarchive" />\n<meta name="ncbi_app" content="entrez" /><meta name="ncbi_db" content="nuccore" /><meta name="ncbi_report" content="fasta" /><meta name="ncbi_format" content="text" /><meta name="ncbi_pagesize" content="20" /><meta name="ncbi_sortorder" content="default" /><meta name="ncbi_pageno" content="1" /><meta name="ncbi_resultcount" content="1" /><meta name="ncbi_op" content="retrieve" /><meta name="ncbi_pdid" content="fasta" /><meta name="ncbi_sessionid" content="CE8C1A31EBE79081_2023SID" /><meta name="ncbi_uidlist" content="3403216" /><meta name="ncbi_filter" content="all" /><meta name="ncbi_stat" content="false" /><meta name="ncbi_hitstat" content="false" />\n\n \n <!-- title -->\n <title>HIV-1 isolate HH8793 clone 12.1 from Finland, complete genome - Nucleotide - NCBI</title>\n \n <!-- Common JS and CSS -->\n \n\t\t<script type="text/javascript">\n\t\t var ncbi_startTime = new Date();\n\t\t</script>\n\t\t<style>.async-hide { opacity: 0 !important} </style><script type="text/javascript" src="/core/assets/kis/dist/kis_ga_nuc_protein.js"></script><script type="text/javascript" src="https://static.pubmed.gov/core/jig/1.14.8/js/jig.min.js"></script>\n\t\t\t\t\n\t\t\t<script type="text/javascript" src="/core/ajax_loader/2.1/js/loadingbar.js"></script> \n \t<script type="text/javascript" src="/core/ajax_loader/2.1/js/contentLoader.js"></script>\n \t<link type="text/css" rel="stylesheet" href="/core/ajax_loader/2.1/css/loadingbar.css" />\n\t\t\t\n\t\t\t \n \n <link xmlns="http://www.w3.org/1999/xhtml" type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4187342/css/3881636/3579733.css" xml:base="http://127.0.0.1/sites/static/header_footer/" /> \n<link rel="shortcut icon" href="//www.ncbi.nlm.nih.gov/favicon.ico" /><meta name="ncbi_phid" content="CE8C1A31EBE6A9A10000000007E703EB.m_11" /><script type="text/javascript"><!--\nvar ScriptPath = \'/portal/\';\nvar objHierarchy = {"name":"EntrezSystem2","type":"Layout","realname":"EntrezSystem2",\n"children":[{"name":"EntrezSystem2.PEntrez","type":"Cluster","realname":"EntrezSystem2.PEntrez",\n"children":[{"name":"EntrezSystem2.PEntrez.DbConnector","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.DbConnector","shortname":"DbConnector"},\n{"name":"EntrezSystem2.PEntrez.ParamContainer","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.ParamContainer","shortname":"ParamContainer"},\n{"name":"EntrezSystem2.PEntrez.MyNcbi","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.MyNcbi","shortname":"MyNcbi"},\n{"name":"EntrezSystem2.PEntrez.UserPreferenceUrlParamContainer","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.UserPreferenceUrlParamContainer","shortname":"UserPreferenceUrlParamContainer"},\n{"name":"EntrezSystem2.PEntrez.GridProperty","type":"Portlet","realname":"EntrezSystem2.PEntrez.PEntrez.GridProperty","shortname":"GridProperty"},\n{"name":"EntrezSystem2.PEntrez.Nuccore","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.NoPortlet","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NoPortlet","shortname":"NoPortlet"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_PageController","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_PageController","shortname":"Sequence_PageController"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_SearchBar","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_SearchBar","shortname":"Entrez_SearchBar"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_BotRequest","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_BotRequest","shortname":"Entrez_BotRequest"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_LimitsTab","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_LimitsTab","shortname":"Sequence_LimitsTab"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.blankToolPanel","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.blankToolPanel","shortname":"blankToolPanel"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_ResultsController","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Sequence_ResultsController","shortname":"Sequence_ResultsController"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Filters","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Entrez_Filters","shortname":"Entrez_Filters"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Entrez_Pager","shortname":"Entrez_Pager"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Sequence_DisplayBar","shortname":"Sequence_DisplayBar"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.HelpFormAttributes","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.HelpFormAttributes","shortname":"HelpFormAttributes"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Collections","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Entrez_Collections","shortname":"Entrez_Collections"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SpellCheck","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.SpellCheck","shortname":"SpellCheck"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SearchEngineReferralCheck","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.SearchEngineReferralCheck","shortname":"SearchEngineReferralCheck"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.KnowledgePanel","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.KnowledgePanel","shortname":"KnowledgePanel"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.HistoryDisplay","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.HistoryDisplay","shortname":"HistoryDisplay"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Discovery_SearchDetails","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Entrez_ResultsPanel.Discovery_SearchDetails","shortname":"Discovery_SearchDetails"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.KISSensor","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.KISSensor","shortname":"KISSensor"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.MultiSensorPortlet","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.MultiSensorPortlet","shortname":"MultiSensorPortlet"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.WrongDbSensor","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.WrongDbSensor","shortname":"WrongDbSensor"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DiscoveryExptChooser","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.Sequence_DiscoveryExptChooser","shortname":"Sequence_DiscoveryExptChooser"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.SequenceViewer",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerTitle","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerTitle","shortname":"Sequence_ViewerTitle"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport","shortname":"Sequence_ViewerReport"}]},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.EmptyPortlet","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_ResultsPanel.EmptyPortlet","shortname":"EmptyPortlet"}]},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_Facets","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence_Facets","shortname":"Sequence_Facets"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_Clipboard","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_Clipboard","shortname":"Entrez_Clipboard"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Sequence_StaticParts","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Sequence_StaticParts","shortname":"Sequence_StaticParts"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.Entrez_Messages","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.Entrez_Messages","shortname":"Entrez_Messages"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NcbiJSCheck","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NcbiJSCheck","shortname":"NcbiJSCheck"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.Footer_ExtraData","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.Footer_ExtraData","shortname":"Footer_ExtraData"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic","type":"Cluster","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic",\n"children":[{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIBreadcrumbs","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIBreadcrumbs","shortname":"NCBIBreadcrumbs"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIHelpDesk","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIHelpDesk","shortname":"NCBIHelpDesk"},\n{"name":"EntrezSystem2.PEntrez.Nuccore.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIApplog_NoScript_Ping","type":"Portlet","realname":"EntrezSystem2.PEntrez.Nuccore.Sequence.Entrez_Database.NCBIFooter_dynamic.NCBIFooter_dynamic.NCBIApplog_NoScript_Ping","shortname":"NCBIApplog_NoScript_Ping"}]}]}]}]}]};\n--></script>\n<meta name=\'referrer\' content=\'origin-when-cross-origin\'/><link type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4189877/css/3808861/3917732/3974050/3751656/3395415/4091728/3257261.css" /><link type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4189877/css/3501913.css" media="print" /><script type="text/javascript">\n\nvar ObjectLinks=[{i:0, ename: "p$ExL", esid:"*", sname: "p$ExL", ssid:"*", dname:"p$el", dsid:"0",m:"CopyValue",p:[],f: function(src, dst) {fn_CopyValue(src, dst);}}]\n\n\nvar ActiveNames = {"p$ExL":1, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ExpandGaps":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.InUse":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ItemCount":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.db":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.display_type":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.fasta_text_params":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.maxdownloadsize":0, "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.report":0};\n</script></head>\n <body>\n <form enctype="application/x-www-form-urlencoded" name="EntrezForm" method="post" onsubmit="return false;" action="/nuccore" id="EntrezForm">\n <div id="maincontent" class="container">\n <div>\n <div id="viewercontent1" class="seq gbff" val="3403216" SequenceSize="14415" VirtualSequence=""></div>\n <div class="hidden">\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.db" sid="1" type="hidden" value="nuccore" />\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.report" sid="1" type="hidden" value="fasta_text" />\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.maxdownloadsize" sid="1" type="hidden" value="1000000" />\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.display_type" sid="1" type="hidden" value="single" />\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ItemCount" sid="1" type="hidden" value="1" />\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.InUse" sid="1" type="hidden" value="" />\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.ExpandGaps" sid="1" type="hidden" value="" />\n <input name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.SequenceViewer.Sequence_ViewerReport.fasta_text_params" sid="1" type="hidden" value="&from=192&to=1684" />\n </div>\n</div>\n\n </div>\n <input type="hidden" name="p$a" id="p$a" /><input type="hidden" name="p$l" id="p$l" value="EntrezSystem2" /><input type="hidden" name="p$st" id="p$st" value="nuccore" /><input name="SessionId" id="SessionId" value="CE8C1A31EBE79081_2023SID" disabled="disabled" type="hidden" /><input name="Snapshot" id="Snapshot" value="/projects/Sequences/SeqDbRelease@1.124" disabled="disabled" type="hidden" /></form>\n \n\n<!-- CE8C1A31EBE79081_2023SID /projects/Sequences/SeqDbRelease@1.124 portal105 v4.1.r585844 Mon, May 06 2019 02:53:16 -->\n\n\n<script type=\'text/javascript\' src=\'/portal/js/portal.js\'></script><script type="text/javascript" src="//static.pubmed.gov/portal/portal3rc.fcgi/4189877/js/4184195/3217400/4176568/4177091.js" snapshot="nuccore"></script></body>\n</html>'
现在,我拥有所有菌株ID和位置范围,因此我需要复制这些DNA序列进行分析。
提前谢谢
答案 0 :(得分:0)
这是您遇到的可快速纠正的错误。
您在进行的操作基本上称为“ Web Scraping”,并且要获得所需的输出,您需要使用其他程序包,例如“ Selenium”或“ BeautifulSoup”。
我个人更喜欢BeautifulSoup而不是Selenium,但是,这只是我的看法。这是BeautifulSoup的实现。
from bs4 import BeautifulSoup
import requests
url = 'https://www.ncbi.nlm.nih.gov/nuccore/AF061641.1?report=fasta&log$=seqview&format=text&from=192&to=1684'
data = requests.get(url)
content = BeautifulSoup(data.content,"html.parser")
这是某些成品的外观。现在,“ content”变量将具有网页的完整html结构。剩下要做的就是在html结构中找到所需的信息,然后从中提取信息。
仅剩一点点时间来完成代码,但是我没有办法完全完成代码,因为我没有所有必需的参数,但是经过大约半个小时的阅读,您应该可以完成其余的工作网页抓取。
希望这对您有所帮助! :)