从具有下拉列表的网址中抓取CSV文件?

时间:2020-03-04 14:37:57

标签: python csv beautifulsoup web-crawler

我正在尝试从Weather Canada抓取每月数据(csv文件)。

通常,需要从下拉列表中选择年/月/日,然后单击“ GO”,然后单击“下载数据”按钮以获取所选月+年的数据,如下所示。 enter image description here 我想从python(带有beautifulsoup 4)的所有可用月份/年份中下载CSV中的所有数据文件。

我尝试修改另一个问题here中的某些代码,但未成功。请帮忙。 从bs4导入BeautifulSoup #Python 3.x 从urllib.request导入urlopen,urlretrieve

# Removed the trailing / from the URL
urlJan2020 = 
'''https://climate.weather.gc.ca/climate_data/hourly_data_e.html?hlyRange=2004-09-24%7C2020-03-03&dlyRange=2018-05-14%7C2020-03-03&mlyRange=%7C&StationID=43403&Prov=NS&urlExtension=_e.html&searchType=stnProx&optLimit=yearRange&StartYear=1840&EndYear=2020&selRowPerPage=25&Line=0&txtRadius=50&optProxType=city&selCity=44%7C40%7C63%7C36%7CHalifax&selPark=&txtCentralLatDeg=&txtCentralLatMin=0&txtCentralLatSec=0&txtCentralLongDeg=&txtCentralLongMin=0&txtCentralLongSec=0&txtLatDecDeg=&txtLongDecDeg=&timeframe=1&Year=2020&Month=1&Day=1#'''
u = urlopen(urlJan2020)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements that have an href attribute, starting with http://
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]

    # You don't need to join + quote as URLs in the HTML are absolute.
    # However, we need a https:// URL (in spite of what the link says: check request in your web browser's developer tools)
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

1 个答案:

答案 0 :(得分:3)

from bs4 import BeautifulSoup
import requests


def Main():
    with requests.Session() as req:
        for year in range(2019, 2021):
            for month in range(1, 13):
                r = req.post(
                    f"https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=43403&Year={year}&Month={month}&Day=1&timeframe=1&submit=Download+Data")
                name = r.headers.get(
                    "Content-Disposition").split("_", 5)[-1][:-1]
                with open(name, 'w') as f:
                    f.write(r.text)
                print(f"Saved {name}")


Main()