使用Mechanize和BeautifulSoup提供搜索参数后,从下一页获取数据

时间:2018-04-01 06:07:25

标签: python beautifulsoup scrapy mechanize

我想在

中抓取数据
  

http://dnre-mrne.gnb.ca/MineralOccurrence/default.aspx

从浏览器中,可以通过提供"%"来获取表格结果(分页)。参考编号。

以下是我的代码段:

import re
import mechanize
from bs4 import BeautifulSoup

class MineralDBScraper(object):
    def __init__(self):
        self.url = "http://dnre-mrne.gnb.ca/MineralOccurrence/default.aspx"
        self.br = mechanize.Browser()
        self.br.addheaders = [('User-agent', 
                                   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]

        self.br.set_handle_redirect(True)
        self.br.set_handle_robots(False)

    def select_form(self,form):
        return form.attrs.get('id', None) == 'MainForm'

    def scrape_state_firms(self, state_item):
        self.br.open(self.url)
        s = BeautifulSoup(self.br.response().read())
        saved_form = s.find('form', id='MainForm').prettify()

        self.br.select_form(predicate=self.select_form)

        self.br.form['ctl00$txtURN'] = state_item
        self.br.form.fixup()

        ctl = self.br.form.find_control('ctl00$reset1')
        self.br.form.controls.remove(ctl)

        self.br.submit()
        print self.br.response().read()

    def scrape(self):
        print 'Scraping all reference numbers by %'
        self.scrape_state_firms('%')

if __name__ == '__main__':
    scraper = MineralDBScraper()
    scraper.scrape()

运行此代码后,我的期望是根据搜索参数'%'来获取表格数据集。

但是,我收到了以前的数据,即搜索参数可用的着陆页。

请帮帮我。我在这里错过了什么吗?

0 个答案:

没有答案