Python:在使用mechanize向表单提交数据后提取.csv结果

时间:2017-01-03 20:43:51

标签: php python forms csv

我很擅长使用Python从网络中提取数据。感谢其他一些帖子和webpage,我想出了如何使用模块<input type="button" name="name" value="Enero" id="mytext" onclick="my()"> function my() { var poop = document.getElementById("mytext").value if (poop = "Enero") { poop = "Febrero" } else if (poop = "Febrero") { poop = "Marzo" } else if (poop = "Marzo") { poop = "Abril" } } 向表单提交数据。

现在,我一直在寻找如何提取结果。提交表单时有很多不同的结果,但是如果我可以访问完美的csv文件。我假设您必须使用模块mechanize,但是如何通过Python下载结果?

运行作业后,csv文件位于:摘要=&gt;结果=&gt;下载重链表(您只需点击&#34;加载示例&#34;查看网页如何工作)。

re

打印import re import mechanize br = mechanize.Browser() br.set_handle_robots(False) # ignore robots br.set_handle_refresh(False) # can sometimes hang without this url = 'http://circe.med.uniroma1.it/proABC/index.php' response = br.open(url) br.form = list(br.forms())[1] # Controls can be found by name control1 = br.form.find_control("light") # Text controls can be set as a string br["light"] = "DIQMTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC" br["heavy"] = "QVQLKESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSSASTTPPSVFPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPRDC" # To submit form response = br.submit() content = response.read() # print content result = re.findall(r"Prob_Heavy.csv", content) print result 时,我感兴趣的行看起来像:

content

所以问题是:如何下载/访问<h2>Results</h2><br> Predictions for Heavy Chain: <a href='u17003I9f1/Prob_Heavy.csv'>Download Heavy Chain Table</a><br> Predictions for Light Chain: <a href='u17003I9f1/Prob_Light.csv'>Download Light Chain Table</a><br>

4 个答案:

答案 0 :(得分:1)

即使它是使用正则表达式解析HTML的黑客,如果格式总是相同的,那么:

result=re.findall("<a href='([^']*)'>",contents)

不确定它是否是最佳/时尚的解决方案,但我会使用wget下载文件

import wget
for r in result:
    # compute full url
    csv_file = url.rpartition("/")[0]+"/"+r
    print("downloading {}".format(csv_file))
    # downloads and saves the .csv file in the current directory
    # "flattening" the path replacing slashes by underscores
    wget.download(csv_file,out=r.replace("/","_"))

答案 1 :(得分:1)

在您正在使用的Python2中,使用urllib2

>>> import urllib2
>>> URL = "http://circe.med.uniroma1.it/proABC/u17003I9f1/Prob_Heavy.csv"
>>> urllib2.urlopen(URL).read()

或者,如果您根据href动态尝试,可以执行以下操作:

>>> import urllib2
>>> href='u17003I9f1/Prob_Heavy.csv'
>>> URL = 'http://circe.med.uniroma1.it/proABC/' + href
>>> urllib2.urlopen(URL).read()

答案 2 :(得分:1)

这是使用BeautifulSouprequests来避免使用正则表达式解析HTML的快速而肮脏的示例。如果您已安装sudo pip install bs4但未安装pip,则BeautifulSoup

import re
import mechanize
from bs4 import BeautifulSoup as bs
import requests
import time


br = mechanize.Browser()
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this

url_base = "http://circe.med.uniroma1.it/proABC/"
url_index = url_base + "index.php"

response = br.open(url_index)

br.form = list(br.forms())[1]

# Controls can be found by name
control1 = br.form.find_control("light")

# Text controls can be set as a string
br["light"] = "DIQMTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC" 
br["heavy"] = "QVQLKESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSSASTTPPSVFPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPRDC"

# To submit form
response = br.submit()
content = response.read()
# print content

soup = bs(content)
urls_csv = [x.get("href") for x in soup.findAll("a") if ".csv" in x.get("href")]
for file_path in urls_csv:
    status_code = 404
    retries = 0
    url_csv = url_base + file_path
    file_name = url_csv.split("/")[-1]
    while status_code == 404 and retries < 10:
        print "{} not ready yet".format(file_name)
        req = requests.get(url_csv )
        status_code = req.status_code
        time.sleep(5)
    print "{} ready. Saving.".format(file_name)
    with open(file_name, "wb") as f:
        f.write(req.content)

在REPL中运行脚本:

Prob_Heavy.csv not ready yet
Prob_Heavy.csv not ready yet
Prob_Heavy.csv not ready yet
Prob_Heavy.csv ready. Saving.
Prob_Light.csv not ready yet
Prob_Light.csv ready. Saving.
>>> 
>>> 

答案 3 :(得分:0)

如果网页存在,两个先前的答案都可以正常工作。但是,当作业运行时,程序需要花费时间(大约30秒)。所以我通过time模块暂停程序来找到答案:

from urllib2 import urlopen
import time

print "Job running..."
time.sleep(60)

csv_files = []

for href in result:
    URL = "http://circe.med.uniroma1.it/proABC/" + href + ".csv"    
    csv_files.append(urlopen(URL).read())
    print("downloading {}".format(URL))

print "Job finished"
print csv_files

我不确定这是更优雅的解决方案,但我确实在这种情况下工作。