网络抓取真的很新,抱歉留下这样一个未解决的问题。在某种程度上,我想知道可行性和如何做到这一点。
我只是真正使用python(这可能会对webscraping造成一些挫折?)
令人讨厌的是,此网站只允许您一次下载1个月,您需要手动定义日期范围,并且您一次只能获得一种类型的数据。这一次我会手动完成,但是将来如果我对网络抓取有更多了解,我相信会有更简洁的方法来解决它。
有关从哪里开始的任何提示?
答案 0 :(得分:2)
使用硒是一种选择。我正在使用python 2.7。小心配置webdriver,有时可能会有点困难。你可以改进它,只选择你需要下载的月份。这个例子从1月到12月下载。
import os
import sys
import time
import calendar
import selenium
from selenium import webdriver
from selenium.webdriver.support.select import Select
# Open the website using webdriver Chrome
browser = webdriver.Chrome()
try:
browser.get('https://www.regelleistung.net/ext/data/')
except:
print '\n #### No se puede abrir la pagina, comprueba tu coneccion a internet #### \n'
# Set as check Herunterladen check botton
browser.find_element_by_id('form-download').click()
list_years = ['2017', '2018']
for year in list_years:
for month in range(1, 12): # From January to December
last_day = calendar.monthrange(int(year), int(month))[1] # Last day of a month
start_date = '01' + '.' + str(month) + '.' + str(year)
end_date = str(last_day) + '.' + str(month) + '.' + str(year)
legend = ' \nRange ' + start_date + ' to ' + end_date + '\n'
print legend
# Fill the date range
browser.find_element_by_id('form-from-date').clear()
time.sleep(1)
browser.find_element_by_id('form-from-date').send_keys(start_date)
browser.find_element_by_id('form-to-date').clear()
time.sleep(1)
browser.find_element_by_id('form-to-date').send_keys(end_date)
# Search each UNB option and then save them into a list
unb = browser.find_element_by_id('form-tso')
unb_options = unb.find_elements_by_tag_name('option')
unb_list = list()
for unbOption in unb_options:
text_unb = unbOption.text
unb_list.append(text_unb)
# Download each element on UNB button options
for unb_list_element in unb_list:
time.sleep(1)
select_unb = Select(browser.find_element_by_id('form-tso'))
select_unb.select_by_visible_text(unb_list_element)
# Search each Datentyp option and then save them into a list
datentyp = browser.find_element_by_id('form-type')
datentyp_options = datentyp.find_elements_by_tag_name('option')
datentyp_list = list()
for datentypOption in datentyp_options:
text_datentyp = datentypOption.text
datentyp_list.append(text_datentyp)
# Download each element on Datentyp button options
for datentyp_list_element in datentyp_list:
time.sleep(1)
select_datentyp = Select(browser.find_element_by_id('form-type'))
select_datentyp.select_by_visible_text(datentyp_list_element)
legend_buttons = ' Selecting ' + unb_list_element + ' and ' + datentyp_list_element + '...'
print legend_buttons
# Click to download the data
time.sleep(1)
browser.find_element_by_id('submit-button').click()