我如何从以色列统计局的网络查询工具中获取数据?

时间:2011-06-21 17:43:32

标签: python perl r screen-scraping security

以下网址:

http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7

向以色列政府提供信息数据生成器,将一次提取的数据点数限制为最多50个系列。我想知道,是否有可能(如果是这样,如何)编写一个webscraper(用你喜欢的语言/软件),它可以跟随每一步的点击,以便能够获得特定主题中的所有系列。

感谢。

3 个答案:

答案 0 :(得分:8)

查看WWW::MechanizeWWW::HtmlUnit

#!/usr/bin/perl

use strict;
use warnings;

use WWW::Mechanize;

my $m = WWW::Mechanize->new;

#get page
$m->get("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7");

#submit the form on the first page
$m->submit_form(
    with_fields => {
        name_tatser => 2, #Orders for export
    }
);

#now that we have the second page, submit the form on it
$m->submit_form(
    with_fields => {
        name_ser => 1576, #Number of companies that answered
    }
);

#and so on...

#printing the source HTML is a good way
#to find out what you need to do next
print $m->content;

答案 1 :(得分:3)

要提交表单,您可以使用Python's mechanize module

import mechanize
import pprint
import lxml.etree as ET
import lxml.html as lh
import urllib
import urllib2

browser=mechanize.Browser()
browser.open("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7")
browser.select_form(nr=0)

我们在这里查看可用选项:

pprint.pprint(browser.form.controls[-2].items)
# [<Item name='1' id=None selected='selected' contents='Volume of orders for the domestic market' value='1' label='Volume of orders for the domestic market'>,
#  <Item name='2' id=None contents='Orders for export' value='2' label='Orders for export'>,
#  <Item name='3' id=None contents='The volume of production' value='3' label='The volume of production'>,
#  <Item name='4' id=None contents='The volume of sales' value='4' label='The volume of sales'>,
#  <Item name='5' id=None contents='Stocks of finished goods' value='5' label='Stocks of finished goods'>,
#  <Item name='6' id=None contents='Access to credit for the company' value='6' label='Access to credit for the company'>,
#  <Item name='7' id=None contents='Change in the number of employees' value='7' label='Change in the number of employees'>]

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1', '2', '3', '4', '5', '6', '7']

browser.form['name_tatser']=['2']
browser.submit()

我们可以为每个后续表格重复这个:

browser.select_form(nr=1)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1576', '1581', '1594', '1595', '1596', '1598', '1597', '1593']

browser.form['name_ser']=['1576']
browser.submit()

browser.select_form(nr=2)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['32', '33', '34', '35', '36', '37', '38', '39', '40', '41']

browser.form['data_kind']=['33']
browser.submit()

browser.select_form(nr=3)
browser.form['ybegin']=['2010']
browser.form['mbegin']=['1']
browser.form['yend']=['2011']
browser.form['mend']=['5']
browser.submit()

此时您有三个选择:

  1. 解析HTML源代码中的数据
  2. 下载.xls文件
  3. 下载XML文件
  4. 我没有在Python中解析.xls的经验,所以我通过了这个选项。

    使用BeautifulSouplxml解析HTML。也许 这本来是最短的解决方案,但找不到HTML的正确XPath对我来说并不是很清楚,所以我选择了XML:

    要从cbs.gov.il网站下载XML,只需点击一个调用javascript函数的按钮即可。哦哦 - 机械化无法执行javascript函数。值得庆幸的是,javascript只是组装了一个新的url。使用lxml拉出参数很简单:

    content=browser.response().read()
    doc=lh.fromstring(content)
    params=dict((elt.attrib['name'],elt.attrib['value']) for elt in doc.xpath('//input'))
    params['king_format']=2
    url='http://www.cbs.gov.il/ts/databank/data_ts_format_e.xml'
    params=urllib.urlencode(dict((p,params[p]) for p in [
        'king_format',
        'tod',
        'time_unit_list',
        'mend',
        'yend',
        'co_code_list',
        'name_tatser_list',
        'ybegin',
        'mbegin',
        'code_list',
        'co_name_tatser_list',
        'level_1',
        'level_2',
        'level_3']))
    
    browser.open(url+'?'+params)
    content=browser.response().read()
    

    现在我们到达另一个绊脚石:XML在iso-8859-8-i中编码。 Python无法识别此编码。我不知道该怎么做,只是将iso-8859-8-i替换为iso-8859-8。我不知道这会造成什么不良副作用。

    # A hack, since I do not know how to deal with iso-8859-8-i
    content=content.replace('iso-8859-8-i','iso-8859-8')
    doc=ET.fromstring(content)
    

    一旦你做到这一点,解析XML很容易:

    for series in doc.xpath('/series_ts/Data_Set/Series'):
        print(series.attrib)
        # {'calc_kind': 'Weighted',
        #  'name_ser': 'Number Of Companies That Answered',
        #  'get_time': '2011-06-21',
        #  'name_topic': "Business Tendency Survey - Distributions Of Businesses By Industry, Kind Of Questions And Answers  - Manufacturing - Company'S Experience Over The Past Three Months - Orders For Export",
        #  'time_unit': 'Month',
        #  'code_series': '22978',
        #  'data_kind': '5-10 Employed Persons',
        #  'decimals': '0',
        #  'unit_kind': 'Number'}
    
        for elt in series.xpath('obs'):
            print(elt.attrib)
            # {'time_period': ' 2010-12', 'value': '40'}
            # {'time_period': ' 2011-01', 'value': '38'}
            # {'time_period': ' 2011-02', 'value': '40'}
            # {'time_period': ' 2011-03', 'value': '36'}
            # {'time_period': ' 2011-04', 'value': '30'}
            # {'time_period': ' 2011-05', 'value': '33'}
    

答案 2 :(得分:1)

您还应该查看Scrapy,它是Python的Web爬网程序框架。有关简介,请参阅'Scrapy一览':http://doc.scrapy.org/intro/overview.html