我正在尝试使用python从以下网站下载结果:
http://david.abcc.ncifcrf.gov/api.jsp?type=GENBANK_ACCESSION&ids=CP000010,CP000125,CP000124,CP000124,CP000124,CP000124&tool=chartReport&annot=KEGG_PATHWAY
在我意识到下载文件是用机制不支持的javascript编写之前,我试图使用机械化。到目前为止,我的代码打开了如下所示的网页。我被困在如何访问网页上的下载链接,以便将数据保存到我的机器上。
import urllib2
def downloadFile():
url = 'http://david.abcc.ncifcrf.gov/api.jsp?type=GENBANK_ACCESSION&ids=CP000010,CP000125,CP000124,CP000124,CP000124,CP000124&tool=chartReport&annot=KEGG_PATHWAY'
t = urllib2.urlopen(url)
s = t.read()
print s
打印的结果是
<html>
<head></head>
<body>
<form name="apiForm" method="POST">
<input type="hidden" name="rowids">
<input type="hidden" name="annot">
<script type="text/javascript">
document.apiForm.rowids.value="4791928,3403495,...."; //There are really about 500 values
document.apiForm.annot.value="48";
document.apiForm.action = "chartReport.jsp";
document.apiForm.submit();
</script>
</form>
</body>
</html>
有人知道如何选择并移至“下载文件”页面并将该文件保存到我的电脑中吗?
答案 0 :(得分:2)
经过对该链接的更多研究后,我想出了这个。你绝对可以使用机械化来做到这一点。
import mechanize
def getJSVariableValue(content, variable):
value_start_index = content.find(variable)
value_start_index = content.find('"', value_start_index) + 1
value_end_index = content.find('"', value_start_index)
value = content[value_start_index:value_end_index]
return value
def getChartReport(url):
br = mechanize.Browser()
resp = br.open(url)
content = resp.read()
br.select_form(name = 'apiForm')
br.form.set_all_readonly(False)
br.form['rowids'] = getJSVariableValue(content, 'document.apiForm.rowids.value')
br.form['annot'] = getJSVariableValue(content, 'document.apiForm.annot.value')
br.form.action = 'http://david.abcc.ncifcrf.gov/' + getJSVariableValue(content, 'document.apiForm.action')
print br.form['rowids']
print br.form['annot']
br.submit()
resp = br.follow_link(text_regex=r'Download File')
content = resp.read()
f = open('output.txt', 'w')
f.write(content)
url = 'http://david.abcc.ncifcrf.gov/api.jsp?type=GENBANK_ACCESSION&ids=CP000010,CP000125,CP000124,CP000124,CP000124,CP000124&tool=chartReport&annot=KEGG_PATHWAY'
chart_output = getChartReport(url)