从网站获取数据 - 下载具有可调日期范围的链接

时间:2018-03-02 14:38:35

标签: python-3.x web-scraping

网络抓取真的很新,抱歉留下这样一个未解决的问题。在某种程度上,我想知道可行性和如何做到这一点。

我只是真正使用python(这可能会对webscraping造成一些挫折?)

令人讨厌的是,此网站只允许您一次下载1个月,您需要手动定义日期范围,并且您一次只能获得一种类型的数据。这一次我会手动完成,但是将来如果我对网络抓取有更多了解,我相信会有更简洁的方法来解决它。

有关从哪里开始的任何提示?

https://www.regelleistung.net/ext/data/

1 个答案:

答案 0 :(得分:2)

使用硒是一种选择。我正在使用python 2.7。小心配置webdriver,有时可能会有点困难。你可以改进它,只选择你需要下载的月份。这个例子从1月到12月下载。Code working

import os
import sys
import time
import calendar
import selenium
from selenium import webdriver
from selenium.webdriver.support.select import Select

# Open the website using webdriver Chrome  
browser = webdriver.Chrome()
try:
    browser.get('https://www.regelleistung.net/ext/data/')
except:
    print '\n #### No se puede abrir la pagina, comprueba tu coneccion a internet #### \n'

# Set as check Herunterladen check botton 
browser.find_element_by_id('form-download').click()

list_years = ['2017', '2018']

for year in list_years:

    for month in range(1, 12): # From January to December 

        last_day = calendar.monthrange(int(year), int(month))[1] # Last day of a month

        start_date = '01' + '.' + str(month) + '.' + str(year)
        end_date = str(last_day) + '.' + str(month) + '.' + str(year)

        legend = ' \nRange ' + start_date + ' to ' + end_date + '\n'
        print legend

        # Fill the date range
        browser.find_element_by_id('form-from-date').clear()
        time.sleep(1)
        browser.find_element_by_id('form-from-date').send_keys(start_date)

        browser.find_element_by_id('form-to-date').clear()
        time.sleep(1)
        browser.find_element_by_id('form-to-date').send_keys(end_date)

        # Search each UNB option and then save them into a list
        unb = browser.find_element_by_id('form-tso') 
        unb_options = unb.find_elements_by_tag_name('option')

        unb_list = list()
        for unbOption in unb_options:
            text_unb = unbOption.text
            unb_list.append(text_unb)

        # Download each element on UNB button options 
        for unb_list_element in unb_list:

            time.sleep(1)
            select_unb = Select(browser.find_element_by_id('form-tso')) 
            select_unb.select_by_visible_text(unb_list_element)

            # Search each Datentyp option and then save them into a list
            datentyp = browser.find_element_by_id('form-type')
            datentyp_options = datentyp.find_elements_by_tag_name('option')

            datentyp_list = list()
            for datentypOption in datentyp_options:
                text_datentyp = datentypOption.text
                datentyp_list.append(text_datentyp)

            # Download each element on Datentyp button options 
            for datentyp_list_element in datentyp_list:

                time.sleep(1)
                select_datentyp = Select(browser.find_element_by_id('form-type'))   
                select_datentyp.select_by_visible_text(datentyp_list_element)

                legend_buttons = '  Selecting ' + unb_list_element + ' and ' + datentyp_list_element + '...'
                print legend_buttons
                # Click to download the data
                time.sleep(1)
                browser.find_element_by_id('submit-button').click()