我一直在做餐馆食品卫生刮刀。我已经能够让刮刀根据邮政编码刮取餐馆的名称,地址和卫生等级。由于食品卫生是通过在线图像显示的,我设置了刮刀来读取" alt ="参数,其中包含食品卫生评分的数值。
包含img alt标签的div我的目标是食品卫生评级如下所示:
<div class="rating-image" style="clear: right;">
<a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
<img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
</a>
</div>
我能够在每家餐厅旁边输出食品卫生分数。
我的问题是,我注意到有些餐馆旁边显示的读数不正确,例如: 3代替食品卫生等级4(这是存储在img alt标签中)
刮刀连接到最初刮擦的链接是
我认为这可能与g_data for loop&#34;中项目的循环评级位置有关。
我发现我是否移动了
appendhygiene(scrape=[name,address,bleh])
下面循环之外的一段代码
for rating in ratings:
bleh = rating['alt']
使用正确的卫生分数正确地抓取数据,唯一的问题是并非所有记录都被删除,在这种情况下它只输出前9家餐馆。
我感谢任何能够查看下面代码并提供帮助以解决问题的人。
PS,我使用邮政编码BT367NG去刮餐馆(如果您测试了脚本,您可以使用它来查看不显示正确卫生价值的餐馆,例如Lins Garden是网站上的4,以及刮下的数据显示3)。
我的完整代码如下:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
hygiene = []
def deletelist():
hygiene.clear()
def savefile():
filename = input("Please input name of file to be saved")
with open (filename + '.csv','w') as file:
writer=csv.writer(file)
writer.writerow(['Address','Town', 'Price', 'Period'])
for row in hygiene:
writer.writerow(row)
print("File Saved Successfully")
def appendhygiene(scrape):
hygiene.append(scrape)
def makesoup(url):
page=requests.get(url)
print(url + " scraped successfully")
return BeautifulSoup(page.text,"lxml")
def hygienescrape(g_data, ratings):
for item in g_data:
try:
name = (item.find_all("a", {"class": "name"})[0].text)
except:
pass
try:
address = (item.find_all("span", {"class": "address"})[0].text)
except:
pass
try:
for rating in ratings:
bleh = rating['alt']
except:
pass
appendhygiene(scrape=[name,address,bleh])
def hygieneratings():
search = input("Please enter postcode")
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0")
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel": "next"}, href=True)
while button_next:
time.sleep(2)#delay time requests are sent so we don't get kicked by server
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"]))
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel" : "next"}, href=True)
def menu():
strs = ('Enter 1 to search Food Hygiene ratings \n'
'Enter 2 to Exit\n' )
choice = input(strs)
return int(choice)
while True: #use while True
choice = menu()
if choice == 1:
hygieneratings()
savefile()
deletelist()
elif choice == 2:
break
elif choice == 3:
break
答案 0 :(得分:1)
看起来你的问题在这里:
try:
for rating in ratings:
bleh = rating['alt']
except:
pass
appendhygiene(scrape=[name,address,bleh])
这样做最终会在每页上追加最后一个值。这就是为什么如果最后一个值是&#34;免除,&#34;所有价值都将免除。如果评级为3,则该页面上的所有值都将为3.依此类推。
你想要写的是这样的东西:
try:
bleh = item.find_all('img', {'alt': True})[0]['alt']
appendhygiene(scrape=[name,address,bleh])
except:
pass
这样每个评级都会单独附加,而不是简单地附加最后一个。我只是测试了它似乎工作:)