Question

我正在尝试提取此网站上的数据＆＃34; https://www.ncbi.nlm.nih.gov/nucleotide/209750423?report=genbank#＆＃34;。当我使用urllib提取内容时，我能够通过选择“查看页面源”来提取我得到的数据。右键单击浏览器后，我想要的是实际序列＆＃39; atggctgaga tgaaaaacct gaaaattgag gtggtgcgct ataacccgga ....＆＃39;通过右键单击浏览器并选择“检查元素”来提取可见的内容。但不是通过查看页面来源＆＃39;

我正在使用的代码是

f = open('out.html', 'w') 
response = urllib.urlopen("https://www.ncbi.nlm.nih.gov/nucleotide/209750423?report=genbank")   
f.write(response.read())
f.close()

Answer 1

您应该花时间实际查看要抓取的页面。它只是一个加载一些JS应用程序的页面。然后，应用程序从另一个地方加载实际数据。

https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=209750423&db=nuccore&dopt=genbank&retmode=text

顺便说一下，在抓取在线内容之前一定要检查版权问题。

Answer 2

数据由js加载，因此您可以获得以下数据：

import requests
from pyquery import PyQuery

r = requests.get("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=209750423&db=nuccore&dopt=genbank&extrafeat=976&fmt_mask=0&retmode=html&withmarkup=on&log$=seqview&maxplex=3&maxdownloadsize=1000000")
pq = PyQuery(r.content)
div = pq(".ff_line")

data = []
for d in div:
    data.append(d.text)

print data

如何使用python

2 个答案: