使用Python进行Web Scraping

时间:2013-04-03 17:52:26

标签: python web-scraping mechanize

我正在使用这个网站(http://gasbuddy.com/)收集汽油价格。基本上,我想编写一个python脚本,将邮政编码输入到页面顶部的搜索框中,然后从下一页中删除结果。我陷入了第一步,即将我想要的邮政编码输入到表单中。这就是我到目前为止所做的:

from mechanize import Browser
import urllib2

br = Browser()
baseURL = "http://www.gasbuddy.com/"
br.open(baseURL)

zipcode = "20010"

forms = [f for f in br.forms()]
print forms[0]
control = forms[0].find_control("ctl00$Content$GBZS$txtZip")
forms[0]["ctl00$Content$GBZS$txtZip"] = "20010"
br.form = forms[0]
page = br.submit()
content = page.read()
br.geturl()

不幸的是,当我提交表单时,br.geturl()告诉我,我没有访问我想要的页面(网址应该看起来像“http://www.washingtondcgasprices.com/index.aspx?area=Washington%20-%20NE&area=Washington%20-%20NW&area=Washington%20-%20SE&area=Washington%20-%20SW”)

如果您有任何指导我会很感激。谢谢!

1 个答案:

答案 0 :(得分:1)

您可以使用Selenium:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

baseURL = "http://www.gasbuddy.com/"

browser = webdriver.Firefox()
zipcode = "20010"

browser.get(baseURL)
elem = browser.find_element_by_id("ctl00_Content_GBZS_txtZip").send_keys(zipcode)
elem = browser.find_element_by_id("ctl00_Content_GBZS_btnSearch").click()

如果你想坚持机械化,你可能想稍微调整你的浏览器。但我仍然怀疑这是在那里杀死你的JavaScript。然后解决方案是"read the javascript yourself and simulate with mechanize what it would be doing"