我正在使用以下代码(稍微修改过Nathan Yau' Visualize This"早期示例)来刮取来自WUnderGround网站的天气数据。如您所见,python正在使用类名" wx-data"从元素中获取数值数据。
但是,我也想从DailyHistory.htmml中获取平均湿度。 问题在于并非所有' span'元素有一个类名,这是普通湿度单元的情况。如何使用BeautifulSoup和下面的代码选择这个特定的单元?
(以下是正在抓取的网页示例 - 点击您的开发模式并搜索' wx-data'以查看被引用的' span'元素:
http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html)
import urllib2
from BeautifulSoup import BeautifulSoup
year = 2004
#create comma-delim file
f = open(str(year) + '_LAXwunder_data.txt','w')
#iterate through month and day
for m in range(1,13):
for d in range (1,32):
#Chk if already gone through month
if (m == 2 and d > 28):
break
elif (m in [4,6,9,11]) and d > 30:
break
# open wug url
timestamp = str(year)+'0'+str(m)+'0'+str(d)
print 'Getting data for ' + timestamp
url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
page = urllib2.urlopen(url)
#Get temp from page
soup = BeautifulSoup(page)
#dayTemp = soup.body.wx-data.b.string
dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string
#Format month for timestamp
if len(str(m)) < 2:
mStamp = '0' + str(m)
else:
mStamp = str(m)
#Format day for timestamp
if len(str(d)) < 2:
dStamp = '0' + str(d)
else:
dStamp = str(d)
#Build timestamp
timestamp = str(year)+ mStamp + dStamp
#Wrtie timestamp and temp to file
f.write(timestamp + ',' + dayTemp +'\n')
#done - close
f.close()
答案 0 :(得分:1)
您可以搜索包含该文本的单元格,然后将向上移动到下一个单元格:
humidity = soup.find(text='Average Humidity')
next_cell = humidity.find_parent('td').find_next_sibling('td')
humidity_value = next_cell.string
我在这里使用的是BeautifulSoup第4版,而不是3;你真的想要升级,因为版本3已经在2年前被封存了。
BeautifulSoup 3也可以做这个特定的技巧;尽管如此,请使用findParent()
和findNextSibling()
。
演示:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> response = requests.get('http://www.wunderground.com/history/airport/LAX/2002/1/1/DailyHistory.html')
>>> soup = BeautifulSoup(response.content)
>>> humidity = soup.find(text='Average Humidity')
>>> next_cell = humidity.find_parent('td').find_next_sibling('td')
>>> next_cell.string
u'88'
答案 1 :(得分:0)
非常感谢@Martijn_Pieters帮助制作这个最终剧本:
import requests
import urllib2
from bs4 import BeautifulSoup
year = 2003
#create comma-delim file
f = open(str(year) + '_LAXwunder_data.txt','w')
#change the year here, ->run
#iterate through month and day
for m in range(1,13):
for d in range(1,32): #could step 5 days using range(1,32,2)
#Chk if already gone through month
if (m == 2 and d > 28):
break
elif (m in [4,6,9,11]) and d > 30:
break
# open wug url
timestamp = str(year)+'.'+str(m)+'.'+str(d)
print 'Getting data for ' + timestamp
url = 'http://www.wunderground.com/history/airport/LAX/'+str(year) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html'
page = urllib2.urlopen(url)
#Get temp from page
soup = BeautifulSoup(page)
#dayTemp = soup.body.wx-data.b.string
dayTemp = soup.findAll(attrs = {'class':'wx-data'})[5].span.string
humidity = soup.find(text='Average Humidity')
next_cell = humidity.find_parent('td').find_next_sibling('td')
avg_humidity = next_cell.string
#Format month for timestamp
if len(str(m)) < 2:
mStamp = '0' + str(m)
else:
mStamp = str(m)
#Format day for timestamp
if len(str(d)) < 2:
dStamp = '0' + str(d)
else:
dStamp = str(d)
#Build timestamp
timestamp = str(year)+ mStamp + dStamp
#Wrtie timestamp and temp to file
f.write(timestamp + ',' + dayTemp + ',' + avg_humidity + '\n')
print dayTemp, avg_humidity
#done - close
f.close()