Question

我是网络抓取的新手。我试图从here获取FASTA文件，但不知何故我不能。首先问题是从我的标签开始，我尝试了一些建议，但没有为我工作我怀疑可能存在隐私问题

此类中的FASTA文件，但是当我运行此代码时，我只能看到FASTA标题：

url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
    print link.text

url = "https://www.ncbi.nlm.nih.gov/nuccore/193211599?report=fasta"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
fasta_data = soup.find_all("div")
for link in soup.find_all("div", {"class": "seqrprt seqviewer"}):
    print link.text

##When I try to reach directly via span, output is empty.
div = soup.find("div", {'id':'viewercontent1'})
spans = div.find_all('span')
for span in spans:
    print span.string

Answer 1

每个刮刮工作都涉及两个阶段：

了解您要抓取的页面。（它是如何工作的？内容是从Ajax加载的？重定向？POST？GET？iframes？反对的东西？......）
使用您喜欢的框架模拟网页

在开始使用第1点之前，请不要编写一行代码.Google网络检查员是您的朋友，请使用它！

关于您的网页，似乎报告已加载到从此网址获取数据的查看器中：

https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=193211599&db=nuccore&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log $ = seqview＆安培; maxdownloadsize = 1000000

使用该网址即可获得报告。

如何通过使用BeautifulSoup进行网页抓取来获取seq标签数据？

1 个答案: