使用Selenium + Python进行Web抓取

时间:2020-10-04 15:41:08

标签: python selenium-webdriver web-scraping

目标是取消http://www.weather.gov.sg/climate-historical-daily/

的历史天气

要获取特定月份的数据,首先必须选择城市名称,月份和年份

有63个城市,12个月和41年

city = [el.text for el in driver.find_elements_by_xpath("/html/body/div/div/div[3]/div[1]/div[1]/div/div/ul/li/a")]
len(city)
Out[182]: 63

month = [el.text for el in driver.find_elements_by_xpath('//*[@id="monthDiv"]/ul/li')]
year = [el.text for el in driver.find_elements_by_xpath('//*[@id="yearDiv"]/ul/li')]

点击显示按钮

button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "display")))
button.click()

如何从这些引导下拉列表中选择选项,并将天气数据复制到其中

<table class="table table-calendar"><colgroup>
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
              </colgroup><thead><tr><th>Date</th><th>Daily Rainfall Total (mm)</th><th>Highest &nbsp;30-min Rainfall (mm)</th><th>Highest &nbsp;60-min Rainfall (mm)</th><th>Highest 120-min Rainfall (mm)</th><th>Mean Temperature (°C)</th><th>Maximum Temperature (°C)</th><th>Minimum Temperature (°C)</th><th>Mean Wind Speed (km/h)</th><th>Max Wind Speed (km/h)</th></tr></thead><tbody><tr><td>1 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.5</td><td align="center">30.4</td><td align="center">26.0</td><td align="center">12.3</td><td align="center">40.7</td></tr><tr><td>2 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.9</td><td align="center">31.7</td><td align="center">26.9</td><td align="center">10.3</td><td align="center">31.5</td></tr><tr><td>3 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.2</td><td align="center">31.7</td><td align="center">27.2</td><td align="center">12.0</td><td align="center">31.5</td></tr><tr><td>4 Aug</td><td align="center">4.8</td><td align="center">4.6</td><td align="center">4.8</td><td align="center">4.8</td><td align="center">27.9</td><td align="center">30.2</td><td align="center">24.1</td><td align="center">8.8</td><td align="center">44.4</td></tr><tr><td>5 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.8</td><td align="center">31.8</td><td align="center">26.7</td><td align="center">8.6</td><td align="center">25.9</td></tr><tr><td>6 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.2</td><td align="center">31.4</td><td align="center">27.6</td><td align="center">8.1</td><td align="center">27.8</td></tr><tr><td>7 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.4</td><td align="center">32.7</td><td align="center">27.3</td><td align="center">11.4</td><td align="center">29.6</td></tr><tr><td>8 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.7</td><td align="center">32.9</td><td align="center">27.6</td><td align="center">11.0</td><td align="center">27.8</td></tr><tr><td>9 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.6</td><td align="center">32.8</td><td align="center">27.7</td><td align="center">12.3</td><td align="center">31.5</td></tr><tr><td>10 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.7</td><td align="center">33.0</td><td align="center">27.8</td><td align="center">12.9</td><td align="center">33.3</td></tr><tr><td>11 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.5</td><td align="center">32.7</td><td align="center">28.2</td><td align="center">11.0</td><td align="center">31.5</td></tr><tr><td>12 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">27.9</td><td align="center">30.0</td><td align="center">26.8</td><td align="center">8.7</td><td align="center">31.5</td></tr><tr><td>13 Aug</td><td align="center">34.6</td><td align="center">22.2</td><td align="center">30.8</td><td align="center">33.4</td><td align="center">28.3</td><td align="center">32.2</td><td align="center">22.5</td><td align="center">6.4</td><td align="center">40.7</td></tr><tr><td>14 Aug</td><td align="center">13.8</td><td align="center">7.2</td><td align="center">12.2</td><td align="center">12.6</td><td align="center">25.9</td><td align="center">28.5</td><td align="center">23.4</td><td align="center">5.1</td><td align="center">35.2</td></tr><tr><td>15 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.0</td><td align="center">31.5</td><td align="center">24.6</td><td align="center">6.5</td><td align="center">25.9</td></tr><tr><td>16 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.0</td><td align="center">30.0</td><td align="center">26.4</td><td align="center">8.0</td><td align="center">27.8</td></tr><tr><td>17 Aug</td><td align="center">5.2</td><td align="center">4.0</td><td align="center">4.6</td><td align="center">4.6</td><td align="center">27.4</td><td align="center">31.4</td><td align="center">24.3</td><td align="center">6.2</td><td align="center">29.6</td></tr><tr><td>18 Aug</td><td align="center">2.0</td><td align="center">1.0</td><td align="center">1.0</td><td align="center">2.0</td><td align="center">27.1</td><td align="center">30.1</td><td align="center">25.3</td><td align="center">6.4</td><td align="center">48.2</td></tr><tr><td>19 Aug</td><td align="center">1.8</td><td align="center">1.4</td><td align="center">1.6</td><td align="center">1.8</td><td align="center">28.0</td><td align="center">31.3</td><td align="center">25.4</td><td align="center">5.7</td><td align="center">25.9</td></tr><tr><td>20 Aug</td><td align="center">2.2</td><td align="center">2.0</td><td align="center">2.0</td><td align="center">2.0</td><td align="center">28.1</td><td align="center">31.9</td><td align="center">25.5</td><td align="center">10.6</td><td align="center">37.0</td></tr><tr><td>21 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.6</td><td align="center">33.0</td><td align="center">27.7</td><td align="center">15.2</td><td align="center">31.5</td></tr><tr><td>22 Aug</td><td align="center">2.0</td><td align="center">1.4</td><td align="center">1.6</td><td align="center">1.6</td><td align="center">27.9</td><td align="center">32.1</td><td align="center">25.3</td><td align="center">9.3</td><td align="center">38.9</td></tr><tr><td>23 Aug</td><td align="center">24.4</td><td align="center">8.2</td><td align="center">11.2</td><td align="center">15.2</td><td align="center">25.6</td><td align="center">27.0</td><td align="center">23.0</td><td align="center">5.1</td><td align="center">48.2</td></tr><tr><td>24 Aug</td><td align="center">0.0</td><td align="center">0.2</td><td align="center">0.2</td><td align="center">0.2</td><td align="center">28.1</td><td align="center">32.4</td><td align="center">24.5</td><td align="center">9.0</td><td align="center">33.3</td></tr><tr><td>25 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">27.9</td><td align="center">31.9</td><td align="center">25.7</td><td align="center">8.6</td><td align="center">44.4</td></tr><tr><td>26 Aug</td><td align="center">4.6</td><td align="center">4.4</td><td align="center">4.6</td><td align="center">4.6</td><td align="center">27.0</td><td align="center">31.3</td><td align="center">24.0</td><td align="center">9.6</td><td align="center">51.9</td></tr><tr><td>27 Aug</td><td align="center">1.4</td><td align="center">1.4</td><td align="center">1.4</td><td align="center">1.4</td><td align="center">27.8</td><td align="center">30.4</td><td align="center">25.6</td><td align="center">8.4</td><td align="center">27.8</td></tr><tr><td>28 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.9</td><td align="center">32.3</td><td align="center">26.2</td><td align="center">9.6</td><td align="center">33.3</td></tr><tr><td>29 Aug</td><td align="center">6.6</td><td align="center">2.8</td><td align="center">3.4</td><td align="center">4.8</td><td align="center">27.2</td><td align="center">30.8</td><td align="center">25.1</td><td align="center">8.0</td><td align="center">-</td></tr><tr><td>30 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.6</td><td align="center">32.1</td><td align="center">26.4</td><td align="center">11.2</td><td align="center">35.2</td></tr><tr><td>31 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.0</td><td align="center">32.2</td><td align="center">27.2</td><td align="center">11.7</td><td align="center">29.6</td></tr></tbody></table>

2 个答案:

答案 0 :(得分:1)

这是另一种方法。

为什么不获取所有城市和所有日期的所有.csv文件?该文件的链接是静态的,并使用下拉菜单中城市的代码。您可以解析它,获取代码,将其放入url中,然后获取.csv文件。哦,而且这些年来你也要循环播放。

顺便说一下,并不是所有城市都有过去40年的数据。

import re
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "PostmanRuntime/7.26.5",
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
}

response = requests.get("http://www.weather.gov.sg/climate-historical-daily/")

soup = BeautifulSoup(response.text, "html.parser").find("ul", {"class": "dropdown-menu long-dropdown"}).find_all("li")
cities_and_codes = {
    t.find("a").getText(strip=True): re.search(r'(S\d+)', t.find("a")['onclick']).group(1)
    for t in soup
}


def get_dates():
    yield from (
        [(y, f"0{m}" if m < 10 else m) for y in range(1980, 2021) for m in range(1, 13)]
    )


files_url = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_"
for city, code in cities_and_codes.items():
    for date in get_dates():
        year, month = date
        csv_url = f"{files_url}{code}_{year}{month}.csv"
        response = requests.get(csv_url)
        if response.status_code == 200:
            print(f"Fetching data for {city} for {month}/{year}")
            print(f"Found data. Fetching {csv_url}")
            with open(f"{city.replace(' ', '_')}_{csv_url.split('/')[-1]}", "wb") as f:
                f.write(response.content)
        else:
            print(f"No data available for {city} for {month}/{year}...")
            continue

您可以尝试一下,仅获取所需城市或所有城市的文件,但这可能需要一段时间。

答案 1 :(得分:0)

城市,月份和年份不是下拉列表。这些是按钮,因此可以通过简单的单击操作进行处理。

请尝试下面的代码选择城市,并对月和年使用相同的方法。

city_button=driver.find_element_by_id('cityname')  #Locate City

city_button.click()                                #Click on City List

Bukit_Timah=driver.find_element_by_xpath("//a[text()='Bukit Timah']") #Locate 'Bukit Timah' city

Bukit_Timah.click()  #Click on 'Bukit Timah' city from the list

Please refer the screenshot to understand the dom