使用漂亮的汤4抓取天气数据(网站使用javascript编码)

时间:2018-08-21 12:50:03

标签: javascript python html web-scraping beautifulsoup

我正在尝试使用beautifulsoup 4从wunderground.com抓取一些天气数据。 我能够找到有关如何执行此操作的教程,但是它显示了如何使用HTML源代码来执行此操作。制作本教程时,Wunderground.com以前使用HTML,但是现在使用js。

我能够获取代码并根据我的特定数据检索需求对其进行操作,但是我仍然坚持如何获取它而不是HTML来提取JavaScript。有人可以帮忙吗?

下面是代码,我来自youtube上SAS Business Analytics的kiengiv。

from bs4 import BeautifulSoup
import urllib3, csv, os, datetime, urllib3.request, re, sys

for vYear in range(2016, 2019):
  for vMonth in range(1, 13):
    for vDay in range(1, 32):
        # go to the next month, if it is a leap year and greater than the 29th or if it is not a leap year
        # and greater than the 28th
        if vYear % 4 == 0:
            if vMonth == 2 and vDay > 29:
                break
        else:
            if vMonth == 2 and vDay > 28:
                break
        # go to the next month, if it is april, june, september or november and greater than the 30th
        if vMonth in [4, 6, 9, 11] and vDay > 30:
            break

        # defining the date string to export and go to the next day using the url
        theDate = str(vYear) + "/" + str(vMonth) + "/" + str(vDay)

        # the new url created after each day
        theurl = "https://www.wunderground.com/history/daily/us/ma/cambridge/KBOS/" + theDate + "date.html"
        # extract the source data for analysis
        http = urllib3.PoolManager()
        thepage = http.request('GET', theurl)
        soup = BeautifulSoup(thepage, "html.parser")
        MaxWindSpeed = Visibility = SeaLevelPressure = Precipitation = High_Temp = Low_Temp = Day_Average_Temp = "N/A"
        for temp in soup.find_all('tr'):
            if temp.text.strip().replace('\n', '')[:6] == 'Actual' or temp.text.strip().replace('\n', '')[-6:] == "Record":
                pass
            elif temp.text.replace('\n', '')[-7:] == "RiseSet":
                break
            elif temp.find_all('td')[0].text == "Day Average Temp":
                if temp.find_all('td')[1].text.strip() == "-":
                    Mean = "N/A"
                else:
                    Mean = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "High Temp":
                if temp.find_all('td')[1].text.strip() == "-":
                    Max = "N/A"
                else:
                    Max = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Low Temp":
                if temp.find_all('td')[1].text.strip() == "-":
                    Min = "N/A"
                else:
                    Min = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Growing Degree Days":
                if temp.find_all('td')[1].text.strip() == "-":
                    GrowingDegreeDays = "N/A"
                else:
                    GrowingDegreeDays = temp.find_all('td')[1].text
            elif temp.find_all('td')[0].text == "Heating Degree Days":
                if temp.find_all('td')[1].text.strip() == "-":
                    HeatingDegreeDays = "N/A"
                else:
                    HeatingDegreeDays = temp.find_all('td')[1].text
            elif temp.find_all('td')[0].text == "Dew Point":
                if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
                    DewPoint = "N/A"
                else:
                    DewPoint = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Precipitation" and temp.find_all('td')[1].text.strip() != "":
                if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
                    Precipitation = "N/A"
                else:
                    Precipitation = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Sea Level Pressure" and temp.find_all('td')[1].text.strip() != "":
                if temp.find_all('td')[1].text.strip() == "-":
                    SeaLevelPressure = "N/A"
                else:
                    SeaLevelPressure = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Max Wind Speed":
                if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
                    MaxWindSpeed = "N/A"
                else:
                    MaxWindSpeed = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Visibility":
                if temp.find_all('td')[1].text.strip() == "-":
                    Visibility = "N/A"
                else:
                    Visibility = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
                    break

        # combining the values to be written to the CSV file
        CombinedString = theDate + "," + Mean + "," + Max + "," + Min + "," + HeatingDegreeDays + "," + DewPoint + "," + "," + Precipitation + "," + SeaLevelPressure + "," + MaxWindSpeed + "," + Visibility + "," + Events + "\n"
        file.write(bytes(CombinedString, encoding="ascii", errors='ignore'))

        # printing to help with any debugging and tracking progress
        print(CombinedString)

file.close()

1 个答案:

答案 0 :(得分:1)

除非您使用硒,否则无法使用beautifulsoup废弃数据。 相反,我找到了几个包含所需数据的Json(不确定这一点,我不知道您想要什么数据)

您可以在开发者控制台(f12)中找到所有json

enter image description here

我特别找到了这个(照亮图片): https://api.weather.com/v1/geocode/42.36416626/-71.00499725/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=20160810&endDate=20160810&units=e

您可以通过更改startDate和endDate对其进行迭代。您还可以在“地理编码”之后更改地理定位

要获取Json,可以使用urllib3和库json。

x = [60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 75, 75, 75, 75, 75, 75, 75, 75, 75, 75, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]

y = [0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01]

z = [0.06, 0.208, 0.399, 0.692, 0.922, 1.172, 1.618, 2.036, 2.5, 2.986, 0.109, 0.316, 0.591, 0.875, 1.181, 1.644, 2.141, 2.637, 3.532, 4.371, 0.124, 0.358, 0.658, 0.976, 1.396, 1.884, 2.42, 3.25, 3.843, 5.144, 0.164, 0.427, 0.73, 1.134, 1.679, 2.225, 2.821, 3.751, 4.646, 5.687, 0.202, 0.478, 0.81, 1.24, 1.844, 2.51, 3.253, 4.364, 5.475, 7.05, 0.223, 0.541, 0.925, 1.433, 2.026, 2.793, 3.811, 5.039, 6.43, 8.047, 0.254, 0.581, 1.017, 1.578, 2.252, 3.119, 4.212, 5.689, 7.68, 9.58, 0.286, 0.654, 1.127, 1.735, 2.514, 3.515, 4.8, 6.448, 8.563, 11.274]

sim_returns = [[60, 0.001, 0.001, 0.041, 0.051, 0.032], [60, 0.002, 0.001, 0.219, 0.252, 0.204], [60, 0.003, 0.001, 0.428, 0.455, 0.393], [60, 0.004, 0.001, 0.653, 0.78, 0.563], [60, 0.005, 0.001, 0.971, 1.038, 0.885], [60, 0.006, 0.001, 1.183, 1.31, 1.023], [60, 0.007, 0.001, 1.495, 1.921, 1.355], [60, 0.008, 0.001, 1.859, 2.121, 1.632], [60, 0.009, 0.001, 2.46, 2.787, 2.327], [60, 0.01, 0.001, 3.112, 3.734, 2.64], [70, 0.001, 0.001, 0.111, 0.121, 0.103], [70, 0.002, 0.001, 0.325, 0.382, 0.278], [70, 0.003, 0.001, 0.585, 0.674, 0.503], [70, 0.004, 0.001, 0.914, 0.997, 0.843], [70, 0.005, 0.001, 1.223, 1.339, 1.151], [70, 0.006, 0.001, 1.613, 1.848, 1.425], [70, 0.007, 0.001, 2.145, 2.292, 1.968], [70, 0.008, 0.001, 2.783, 2.94, 2.635], [70, 0.009, 0.001, 3.44, 3.668, 2.712], [70, 0.01, 0.001, 4.322, 4.703, 4.0], [75, 0.001, 0.001, 0.138, 0.155, 0.118], [75, 0.002, 0.001, 0.363, 0.378, 0.349], [75, 0.003, 0.001, 0.641, 0.681, 0.589], [75, 0.004, 0.001, 1.004, 1.131, 0.918], [75, 0.005, 0.001, 1.301, 1.367, 1.229], [75, 0.006, 0.001, 1.892, 2.119, 1.637], [75, 0.007, 0.001, 2.76, 3.019, 2.426], [75, 0.008, 0.001, 3.095, 3.31, 2.905], [75, 0.009, 0.001, 4.32, 4.812, 3.955], [75, 0.01, 0.001, 4.859, 5.225, 4.518], [80, 0.001, 0.001, 0.157, 0.176, 0.148], [80, 0.002, 0.001, 0.432, 0.472, 0.411], [80, 0.003, 0.001, 0.755, 0.828, 0.721], [80, 0.004, 0.001, 1.103, 1.131, 1.057], [80, 0.005, 0.001, 1.611, 1.782, 1.439], [80, 0.006, 0.001, 2.231, 2.344, 2.097], [80, 0.007, 0.001, 3.038, 3.352, 2.74], [80, 0.008, 0.001, 3.718, 4.065, 3.548], [80, 0.009, 0.001, 4.745, 4.988, 4.529], [80, 0.01, 0.001, 6.022, 6.749, 5.503], [85, 0.001, 0.001, 0.2, 0.219, 0.192], [85, 0.002, 0.001, 0.46, 0.494, 0.437], [85, 0.003, 0.001, 0.812, 0.85, 0.756], [85, 0.004, 0.001, 1.256, 1.343, 1.184], [85, 0.005, 0.001, 1.847, 1.867, 1.816], [85, 0.006, 0.001, 2.42, 2.536, 2.344], [85, 0.007, 0.001, 3.265, 3.422, 3.083], [85, 0.008, 0.001, 4.445, 4.691, 4.157], [85, 0.009, 0.001, 5.492, 5.816, 5.108], [85, 0.01, 0.001, 6.793, 7.366, 6.417], [90, 0.001, 0.001, 0.232, 0.238, 0.219], [90, 0.002, 0.001, 0.518, 0.544, 0.494], [90, 0.003, 0.001, 0.892, 0.925, 0.865], [90, 0.004, 0.001, 1.426, 1.475, 1.39], [90, 0.005, 0.001, 2.076, 2.174, 1.919], [90, 0.006, 0.001, 2.841, 2.982, 2.74], [90, 0.007, 0.001, 3.839, 4.146, 3.639], [90, 0.008, 0.001, 4.902, 5.115, 4.54], [90, 0.009, 0.001, 6.504, 6.993, 6.164], [90, 0.01, 0.001, 8.463, 8.86, 8.033], [95, 0.001, 0.001, 0.257, 0.271, 0.251], [95, 0.002, 0.001, 0.603, 0.625, 0.577], [95, 0.003, 0.001, 1.015, 1.036, 1.004], [95, 0.004, 0.001, 1.571, 1.628, 1.537], [95, 0.005, 0.001, 2.257, 2.37, 2.081], [95, 0.006, 0.001, 3.241, 3.36, 3.095], [95, 0.007, 0.001, 4.34, 4.485, 4.146], [95, 0.008, 0.001, 5.762, 5.933, 5.569], [95, 0.009, 0.001, 7.37, 7.656, 7.154], [95, 0.01, 0.001, 9.615, 10.368, 8.968], [100, 0.001, 0.001, 0.286, 0.286, 0.286], [100, 0.002, 0.001, 0.654, 0.654, 0.654], [100, 0.003, 0.001, 1.127, 1.127, 1.127], [100, 0.004, 0.001, 1.735, 1.735, 1.735], [100, 0.005, 0.001, 2.514, 2.514, 2.514], [100, 0.006, 0.001, 3.515, 3.515, 3.515], [100, 0.007, 0.001, 4.8, 4.8, 4.8], [100, 0.008, 0.001, 6.448, 6.448, 6.448], [100, 0.009, 0.001, 8.563, 8.563, 8.563], [100, 0.01, 0.001, 11.274, 11.274, 11.274]]