我正在尝试在网络上刮刮天气数据以学习刮刮基础知识,在网站包含的HTML结构上遇到了一些问题。
我已经调试了html页面中的嵌套结构,可以通过打印出d["precip"]
来显示第一个数据,但是我不知道为什么下一个循环无法读取该迭代,因此print(i)
仍在进行迭代,可以显示其正常工作。
第一个循环的结果:
{'date': '19:30', 'hourly-date': 'Thu',
'hidden-cell-sm description': 'Mostly Cloudy',
'temp': '26°', 'feels': '30°', 'precip': '15%',
'humidity': '84%', 'wind': 'SSE 12 km/h '}
在第一个循环之后:
{'date': 'None', 'hourly-date': 'None',
'hidden-cell-sm description': 'None',
'temp': 'None', 'feels': 'None', 'precip': 'None',
'humidity': 'None', 'wind': 'None'}
HTML端: 我要剪贴的值是“ 10”和“%”,我是在第一次迭代中完成的,但是我不知道为什么第二次将其变为“无”
<td class="precip" headers="precip" data-track-string="ls_hourly_ls_hourly_toggle" classname="precip">
<div><span class="icon icon-font iconset-weather-data icon-drop-1" classname="icon icon-font iconset-weather-data icon-drop-1"></span>
<span class="">
<span>
10
<span class="Percentage__percentSymbol__2Q_AR">
%
</span>
</span>
</span>
</div>
</td>
Python代码
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
#all = soup.find("div", {"class": "locations-title hourly-page-title"}).find("h1").text
table = soup.find_all("table", {"class": "twc-table"})
for items in table:
for i in range(len(items.find_all("tr")) - 1):
d = {}
try:
d["date"] = items.find_all("span", {"class": "dsx-date"})[i].text
d["hourly-date"] = items.find_all("div", {"class": "hourly-date"})[i].text
d["hidden-cell-sm description"] = items.find_all("td", {"class": "hidden-cell-sm description"})[i].text
d["temp"] = items.find_all("td", {"class": "temp"})[i].text
d["feels"] = items.find_all("td", {"class": "feels"})[i].text
#issue starts from here
inclass = items.find_all("td", {"class": "precip"})[i]
realtext = inclass.find_all("div", "")[i]
d["precip"] = realtext.find_all("span", {"class": ""})[i].text
#issue end
d["humidity"] = items.find_all("td", {"class": "humidity"})[i].text
d["wind"] = items.find_all("td", {"class": "wind"})[i].text
except:
d["date"] = "None"
d["hourly-date"] = "None"
d["hidden-cell-sm description"] = "None"
d["temp"] = "None"
d["precip"] = "None"
d["feels"] = "None"
d["precip"] = "None"
d["humidity"] = "None"
d["wind"] = "None"
total.append(d)
df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])
我希望删除所有数据,但是如上所述,“密码”丢失了,但其他信息仍然存在。 有关更多信息,这是结果
Date weekdays Description temp feels percip humidity wind
0 19:30 Thu Mostly Cloudy 26° 30° NaN 84% SSE 12 km/h
1 20:00 Thu Mostly Cloudy 26° 30° NaN 86% SSE 11 km/h
2 21:00 Thu Mostly Cloudy 26° 30° NaN 86% SSE 12 km/h
3 22:00 Thu Mostly Cloudy 26° 29° NaN 86% SSE 12 km/h
4 23:00 Thu Cloudy 26° 29° NaN 87% SSE 12 km/h
5 00:00 Fri Cloudy 26° 29° NaN 87% S 12 km/h
6 01:00 Fri Cloudy 26° 26° NaN 88% S 12 km/h
7 02:00 Fri Cloudy 26° 26° NaN 87% S 12 km/h
8 03:00 Fri Cloudy 29° 35° NaN 87% S 12 km/h
9 04:00 Fri Mostly Cloudy 29° 35° NaN 87% S 12 km/h
10 05:00 Fri Mostly Cloudy 28° 35° NaN 87% SSW 11 km/h
11 06:00 Fri Mostly Cloudy 28° 34° NaN 88% SSW 11 km/h
12 07:00 Fri Mostly Cloudy 29° 35° NaN 87% SSW 10 km/h
13 08:00 Fri Mostly Cloudy 29° 36° NaN 84% SSW 12 km/h
14 09:00 Fri Mostly Cloudy 29° 37° NaN 82% SSW 13 km/h
15 10:00 Fri Partly Cloudy 30° 37° NaN 81% SSW 14 km/h
在这里的新手,我想学习,请告诉我如何改进我的代码结构。非常感谢
答案 0 :(得分:1)
您的precip
变量一无所获,这就是您得到的结果。要解决此问题,可以使用此类Percentage__percentSymbol__2Q_AR
,然后使用它的previous_sibling
来提取所需的内容。我试图向您展示您遇到麻烦的以下部分。
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
soup = BeautifulSoup(page.text, "html.parser")
total = []
for tr in soup.find("table",class_="twc-table").tbody.find_all("tr"):
d = {}
d["date"] = tr.find("span", class_="dsx-date").text
d["precip"] = tr.find("span", class_="Percentage__percentSymbol__2Q_AR").previous_sibling
total.append(d)
df = pandas.DataFrame(total,columns=['date','precip'])
print(df)
答案 1 :(得分:0)
find_all
函数总是返回一个列表,strip()
是删除字符串开头和结尾的空格。和percip
在df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])
中定义了错误的标签,因为您在字典中定义了d["precip"] = "None"
。
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
tables = soup.find_all("table", {"class": "twc-table"})
for table in tables:
for tr in table.find("tbody").find_all("tr"):
d = {"date":"None","hourly-date":"None","hidden-cell-sm description":"None","temp":"None","precip":"None",\
"feels":"None","precip":"None","humidity":"None","wind":"None"}
for td in tr.find_all("td"):
try:
_class = td.get("class")
if len(_class) > 1:
temp = 0
for cc in _class:
if "cell-hide" in cc:
temp+=1
break
if temp > 0:
continue
if len(_class)>1 and "description" in _class[1]:
d["hidden-cell-sm description"] = td.text.strip()
elif _class[0] in "temp":
d["temp"] = td.text.strip()
elif "feels" in _class[0]:
d["feels"] = td.text.strip()
elif "precip" in _class[0]:
d["precip"] = td.text.strip()
elif "humidity" in _class[0]:
d["humidity"] = td.text.strip()
elif "wind" in _class[0]:
d["wind"] = td.text.strip()
else:
d["date"] = td.find("span", {"class": "dsx-date"}).text.strip()
d["hourly-date"] = td.find("div", {"class": "hourly-date"}).text.strip()
except:
pass
total.append(d)
df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'precip', 'humidity', 'wind'])
print(df)
O / P:
Date weekdays Description temp feels precip humidity wind
0 20:30 Thu Mostly Cloudy 26° 30° 10% 85% SSE 12 km/h
1 21:00 Thu Mostly Cloudy 26° 30° 5% 85% SSE 12 km/h
2 22:00 Thu Mostly Cloudy 26° 30° 0% 85% SSE 12 km/h
3 23:00 Thu Mostly Cloudy 26° 29° 0% 87% SSE 12 km/h
4 00:00 Fri Cloudy 26° 29° 0% 87% S 12 km/h
5 01:00 Fri Cloudy 26° 26° 5% 88% S 12 km/h
6 02:00 Fri Cloudy 26° 26° 15% 88% S 12 km/h
7 03:00 Fri Mostly Cloudy 25° 25° 20% 88% S 10 km/h
8 04:00 Fri Mostly Cloudy 25° 29° 25% 88% S 10 km/h
9 05:00 Fri Mostly Cloudy 25° 28° 25% 88% SSW 10 km/h
10 06:00 Fri Mostly Cloudy 25° 28° 25% 89% SSW 10 km/h
11 07:00 Fri Mostly Cloudy 26° 29° 25% 88% SSW 10 km/h
12 08:00 Fri Mostly Cloudy 26° 29° 25% 84% SSW 11 km/h
13 09:00 Fri Partly Cloudy 27° 30° 25% 82% SSW 12 km/h
14 10:00 Fri Partly Cloudy 27° 30° 25% 81% SSW 14 km/h
15 11:00 Fri Partly Cloudy 27° 31° 15% 78% SSW 15 km/h