输出的img alt值不正确(Python3,Beautiful Soup 4)

时间:2017-07-16 01:58:29

标签: python beautifulsoup screen-scraping python-3.6 scrape

我一直在做餐馆食品卫生刮刀。我已经能够让刮刀根据邮政编码刮取餐馆的名称,地址和卫生等级。由于食品卫生是通过在线图像显示的,我设置了刮刀来读取" alt ="参数,其中包含食品卫生评分的数值。

包含img alt标签的div我的目标是食品卫生评级如下所示:

<div class="rating-image" style="clear: right;">
            <a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
                <img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
            </a>
        </div>

我能够在每家餐厅旁边输出食品卫生分数。

我的问题是,我注意到有些餐馆旁边显示的读数不正确,例如: 3代替食品卫生等级4(这是存储在img alt标签中)

刮刀连接到最初刮擦的链接是

https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=BT367NG&distance=1&search.x=16&search.y=21&gbt_id=0

我认为这可能与g_data for loop&#34;中项目的循环评级位置有关。

我发现我是否移动了

appendhygiene(scrape=[name,address,bleh])

下面循环之外的一段代码

for rating in ratings:
                bleh = rating['alt']

使用正确的卫生分数正确地抓取数据,唯一的问题是并非所有记录都被删除,在这种情况下它只输出前9家餐馆。

我感谢任何能够查看下面代码并提供帮助以解决问题的人。

PS,我使用邮政编码BT367NG去刮餐馆(如果您测试了脚本,您可以使用它来查看不显示正确卫生价值的餐馆,例如Lins Garden是网站上的4,以及刮下的数据显示3)。

我的完整代码如下:

import requests
import time
import csv
import sys
from bs4 import BeautifulSoup

hygiene = []

def deletelist():
    hygiene.clear()


def savefile():
    filename = input("Please input name of file to be saved")        
    with open (filename + '.csv','w') as file:
       writer=csv.writer(file)
       writer.writerow(['Address','Town', 'Price', 'Period'])
       for row in hygiene:
          writer.writerow(row)
    print("File Saved Successfully")


def appendhygiene(scrape):
    hygiene.append(scrape)

def makesoup(url):
    page=requests.get(url)
    print(url + "  scraped successfully")
    return BeautifulSoup(page.text,"lxml")


def hygienescrape(g_data, ratings):
    for item in g_data:
        try:
            name = (item.find_all("a", {"class": "name"})[0].text)
        except:
            pass
        try:
            address = (item.find_all("span", {"class": "address"})[0].text)
        except:
            pass
        try:
            for rating in ratings:
                    bleh = rating['alt']

        except:
            pass

        appendhygiene(scrape=[name,address,bleh])








def hygieneratings():

    search = input("Please enter postcode")
    soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0")
    hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))

    button_next = soup.find("a", {"rel": "next"}, href=True)
    while button_next:
        time.sleep(2)#delay time requests are sent so we don't get kicked by server
        soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"]))
        hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))

        button_next = soup.find("a", {"rel" : "next"}, href=True)


def menu():
        strs = ('Enter 1 to search Food Hygiene ratings \n'
            'Enter 2 to Exit\n' )
        choice = input(strs)
        return int(choice) 

while True:          #use while True
    choice = menu()
    if choice == 1:
        hygieneratings()
        savefile()
        deletelist()
    elif choice == 2:
        break
    elif choice == 3:
        break

1 个答案:

答案 0 :(得分:1)

看起来你的问题在这里:

try:
    for rating in ratings:
        bleh = rating['alt']

except:
    pass

appendhygiene(scrape=[name,address,bleh])

这样做最终会在每页上追加最后一个值。这就是为什么如果最后一个值是&#34;免除,&#34;所有价值都将免除。如果评级为3,则该页面上的所有值都将为3.依此类推。

你想要写的是这样的东西:

try:
    bleh = item.find_all('img', {'alt': True})[0]['alt']
    appendhygiene(scrape=[name,address,bleh])

except:
    pass

这样每个评级都会单独附加,而不是简单地附加最后一个。我只是测试了它似乎工作:)