我想在
中抓取数据从浏览器中,可以通过提供"%"来获取表格结果(分页)。参考编号。
以下是我的代码段:
import re
import mechanize
from bs4 import BeautifulSoup
class MineralDBScraper(object):
def __init__(self):
self.url = "http://dnre-mrne.gnb.ca/MineralOccurrence/default.aspx"
self.br = mechanize.Browser()
self.br.addheaders = [('User-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]
self.br.set_handle_redirect(True)
self.br.set_handle_robots(False)
def select_form(self,form):
return form.attrs.get('id', None) == 'MainForm'
def scrape_state_firms(self, state_item):
self.br.open(self.url)
s = BeautifulSoup(self.br.response().read())
saved_form = s.find('form', id='MainForm').prettify()
self.br.select_form(predicate=self.select_form)
self.br.form['ctl00$txtURN'] = state_item
self.br.form.fixup()
ctl = self.br.form.find_control('ctl00$reset1')
self.br.form.controls.remove(ctl)
self.br.submit()
print self.br.response().read()
def scrape(self):
print 'Scraping all reference numbers by %'
self.scrape_state_firms('%')
if __name__ == '__main__':
scraper = MineralDBScraper()
scraper.scrape()
运行此代码后,我的期望是根据搜索参数'%'来获取表格数据集。
但是,我收到了以前的数据,即搜索参数可用的着陆页。
请帮帮我。我在这里错过了什么吗?