如何从具有多个“选择”字段的网站抓取?

时间:2019-08-09 12:20:49

标签: python html web-scraping

我希望从https://www.timeanddate.com/weather/usa/new-york/historic?month=12&year=2018刮起整个2018年12月的天气

此网页有2个选择字段。我对html和发布请求是完全陌生的。我已经阅读了Filling out a select tag with requests Python的答案。在我看来,我需要包括所有字段id-value对。下面是我的代码。

import requests
r = requests.post(
    "https://www.timeanddate.com/weather/usa/new-york/historic?month=12&year=2018",
    data={
        "month": r'2018-12',
        "wt-his-select": r"20181205",
    })

我希望根据我在上面输入的id-value对,获得2018年12月5日的天气记录,但我总是会得到12月1日的天气

1 个答案:

答案 0 :(得分:2)

由于数据以json格式存在,因此我们的beautifulsoup可以提取<script>标签。然后将其读入字典以转换为数据框:

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

r = requests.get("https://www.timeanddate.com/weather/usa/new-york/historic?month=12&year=2018")

soup = BeautifulSoup(r.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
    if 'var data=' in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('var data=')[-1].split(';window.')[0]

        jsonData = json.loads(jsonStr)

weather = jsonData['detail']
results = pd.DataFrame()
for each in weather:
    results = results.append(pd.DataFrame([each]), sort=True).reset_index(drop=True)

输出:

print (results)
      baro          date                        desc  ...     ts wd wind
0    30.14  1.543622e+12                      Clear.  ...  12 am  0    0
1    30.21  1.543644e+12                      Sunny.  ...   6 am  0    0
2    30.17  1.543666e+12                      Sunny.  ...  12 pm  0    0
3    30.13  1.543687e+12       Light rain. Overcast.  ...   6 pm  0    0
4    29.96  1.543709e+12            Light rain. Fog.  ...  12 am  0    0
5    29.80  1.543730e+12            Light rain. Fog.  ...   6 am  0    0
6    29.65  1.543752e+12                        Fog.  ...  12 pm  0    0
7    29.62  1.543774e+12                        Fog.  ...   6 pm  0    0
8    29.58  1.543795e+12             Passing clouds.  ...  12 am  0    0
9    29.63  1.543817e+12                      Sunny.  ...   6 am  0    0
10   29.66  1.543838e+12                   Overcast.  ...  12 pm  0    0
11   29.72  1.543860e+12                      Clear.  ...   6 pm  0    0
12   29.80  1.543882e+12                   Overcast.  ...  12 am  0    0
13   29.93  1.543903e+12                   Overcast.  ...   6 am  0    0
14   29.96  1.543925e+12                      Sunny.  ...  12 pm  0    0
15   30.06  1.543946e+12                      Clear.  ...   6 pm  0    0
16   30.08  1.543968e+12                      Clear.  ...  12 am  0    0
17   30.09  1.543990e+12                      Sunny.  ...   6 am  0    0
18   30.03  1.544011e+12                      Sunny.  ...  12 pm  0    0
19   30.09  1.544033e+12                      Clear.  ...   6 pm  0    0
20   30.14  1.544054e+12                      Clear.  ...  12 am  0    0
21   30.19  1.544076e+12                      Sunny.  ...   6 am  0    0
22   30.15  1.544098e+12                      Sunny.  ...  12 pm  0    0
23   30.14  1.544119e+12              Mostly cloudy.  ...   6 pm  0    0
24   30.18  1.544141e+12             Passing clouds.  ...  12 am  0    0
25   30.32  1.544162e+12                      Sunny.  ...   6 am  0    0
26   30.34  1.544184e+12                      Sunny.  ...  12 pm  0    0
27   30.44  1.544206e+12                      Clear.  ...   6 pm  0    0
28   30.45  1.544227e+12                      Clear.  ...  12 am  0    0
29   30.48  1.544249e+12             Passing clouds.  ...   6 am  0    0
..     ...           ...                         ...  ...    ... ..  ...
94   30.03  1.545653e+12               Partly sunny.  ...  12 pm  0    0
95   30.09  1.545674e+12                      Clear.  ...   6 pm  0    0
96   30.17  1.545696e+12                      Clear.  ...  12 am  0    0
97   30.26  1.545718e+12                   Overcast.  ...   6 am  0    0
98   30.27  1.545739e+12                      Sunny.  ...  12 pm  0    0
99   30.34  1.545761e+12                      Clear.  ...   6 pm  0    0
100  30.40  1.545782e+12                      Clear.  ...  12 am  0    0
101  30.47  1.545804e+12                   Overcast.  ...   6 am  0    0
102  30.43  1.545826e+12               Partly sunny.  ...  12 pm  0    0
103  30.47  1.545847e+12                      Clear.  ...   6 pm  0    0
104  30.52  1.545869e+12                   Overcast.  ...  12 am  0    0
105  30.60  1.545890e+12                      Sunny.  ...   6 am  0    0
106  30.56  1.545912e+12                      Sunny.  ...  12 pm  0    0
107  30.51  1.545934e+12                   Overcast.  ...   6 pm  0    0
108  30.34  1.545955e+12            Light rain. Fog.  ...  12 am  0    0
109  30.14  1.545977e+12                  Rain. Fog.  ...   6 am  0    0
110  29.91  1.545998e+12            Light rain. Fog.  ...  12 pm  0    0
111  29.83  1.546020e+12                        Fog.  ...   6 pm  0    0
112  29.85  1.546042e+12              Mostly cloudy.  ...  12 am  0    0
113  29.97  1.546063e+12           Scattered clouds.  ...   6 am  0    0
114  30.07  1.546085e+12               Partly sunny.  ...  12 pm  0    0
115  30.16  1.546106e+12                   Overcast.  ...   6 pm  0    0
116  30.17  1.546128e+12                      Clear.  ...  12 am  0    0
117  30.23  1.546150e+12       Light snow. Overcast.  ...   6 am  0    0
118  30.21  1.546171e+12                   Overcast.  ...  12 pm  0    0
119  30.27  1.546193e+12              Mostly cloudy.  ...   6 pm  0    0
120  30.30  1.546214e+12                      Clear.  ...  12 am  0    0
121  30.34  1.546236e+12                   Overcast.  ...   6 am  0    0
122  30.23  1.546258e+12  Light rain. Mostly cloudy.  ...  12 pm  0    0
123  30.00  1.546279e+12            Heavy rain. Fog.  ...   6 pm  0    0

[124 rows x 14 columns]

附加:

您可以通过访问json获得单个日期(小时)。只需更改payload中的参数即可获取特定日期:

import pandas as pd

url = 'https://www.timeanddate.com/scripts/cityajax.php'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

year = 2018
month = 12
day = 1

payload = {
'n': 'usa/new-york',
'mode': 'historic',
'hd': '%d%02d%02d' %(year, month, day),
'month': '%02d' %(month),
'year': '%d' %(year)}

data = requests.get(url, headers=headers, params=payload).text
table = pd.read_html('<table>' + data + '</table>')[0][:-1]
table = table.dropna(axis=1)

输出:

print (table.to_string())
    Unnamed: 0_level_0 Conditions                                        Comfort          Unnamed: 7_level_0 Unnamed: 8_level_0
                  Time       Temp                     Weather Unnamed: 5_level_1 Humidity          Barometer         Visibility
0   12:51 amSat, Dec 1      40 °F                   Overcast.                  ↑      80%          30.11 "Hg              10 mi
1              1:51 am      40 °F             Passing clouds.                  ↑      77%          30.12 "Hg              10 mi
2              2:51 am      39 °F                      Clear.                  ↑      79%          30.12 "Hg              10 mi
3              3:51 am      39 °F                      Clear.                  ↑      79%          30.13 "Hg              10 mi
4              4:51 am      38 °F             Passing clouds.                  ↑      79%          30.16 "Hg              10 mi
5              5:51 am      37 °F                      Clear.                  ↑      82%          30.17 "Hg               9 mi
6              6:51 am      37 °F                      Clear.                  ↑      86%          30.19 "Hg              10 mi
7              7:51 am      38 °F                      Sunny.                  ↑      79%          30.21 "Hg              10 mi
8              8:51 am      40 °F                      Sunny.                  ↑      73%          30.21 "Hg              10 mi
9              9:51 am      42 °F                      Sunny.                  ↑      68%          30.22 "Hg              10 mi
10            10:51 am      44 °F           Scattered clouds.                  ↑      63%          30.21 "Hg              10 mi
11            11:51 am      44 °F                      Sunny.                  ↑      60%          30.21 "Hg              10 mi
12            12:51 pm      45 °F                      Sunny.                  ↑      58%          30.18 "Hg              10 mi
13             1:51 pm      46 °F             Passing clouds.                  ↑      56%          30.17 "Hg              10 mi
14             2:51 pm      45 °F                      Sunny.                  ↑      58%          30.17 "Hg              10 mi
15             3:51 pm      45 °F                      Sunny.                  ↑      56%          30.17 "Hg              10 mi
16             4:51 pm      44 °F                      Clear.                  ↑      63%          30.17 "Hg              10 mi
17             5:51 pm      43 °F             Passing clouds.                  ↑      62%          30.16 "Hg              10 mi
18             6:51 pm      42 °F  Light rain. Mostly cloudy.                  ↑      82%          30.16 "Hg               7 mi
19             7:51 pm      42 °F       Light rain. Overcast.                  ↑      79%          30.15 "Hg               7 mi
20             8:51 pm      41 °F  Light rain. Mostly cloudy.                  ↑      86%          30.15 "Hg              10 mi
21             9:51 pm      42 °F              Mostly cloudy.                  ↑      82%          30.14 "Hg              10 mi
22            10:32 pm      42 °F       Light rain. Overcast.                  ↑      85%          30.15 "Hg               8 mi
23            10:51 pm      42 °F       Light rain. Overcast.                  ↑      89%          30.11 "Hg               8 mi
24            11:51 pm      42 °F                        Fog.                  ↑      92%          30.07 "Hg               4 mi