Python-无法从Zomato页面抓取菜单数据。

时间:2018-10-03 02:50:35

标签: python selenium-webdriver web-scraping beautifulsoup geckodriver

我想制作一个刮板,以刮擦Zomato上一家餐馆的订单页面,并提取食物菜单并将其写入JSON文件。这是我的代码:

import re
import urllib
from urllib import parse
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import urllib.request
from selenium import webdriver
from bs4 import NavigableString
import sys
import json

browser = None
try:
    browser = webdriver.Firefox()
except Exception as error:
    print(error)


class ZomatoRestaurant:
    def __init__(self, url):
    self.url = url
    # print("opening")
    self.html_text = None
    try:
        browser.get(self.url)
        self.html_text = browser.page_source
        # self.html_text = urllib.request.urlopen(url).read().decode('utf-8')
        # self.html_text = requests.get(url).text
    except Exception as err:
        print(str(err))
        return
    else:
        print('Access successful.')

    self.soup = None
    if self.html_text is not None:
        self.soup = BeautifulSoup(self.html_text, 'lxml')

    def scrap(self):
        if self.soup is None:
            return {}
        soup = self.soup
        menu_details = dict()

        name_anchor = soup.find("a", attrs={"class": "o2header-title"})
        if name_anchor:
            menu_details['restaurant_name'] = name_anchor.text.strip()
        else:
            menu_details['restaurant_name'] = ''




        menu_details['dish_mappings'] = []
        for div in soup.find_all("div", attrs={'class': 'ui item item-view'}):
        child_div_dish_name = div.find("div", attrs={'class': 'header'})
        child_div_dish_price = div.find("div", attrs={'class': 'description'})
        menu_details['dish_detail']=[]
        if child_div_dish_name:
            menu_details[dish_detail]['dish_name'].append(child_div_dish_name.get_text())
            menu_details[dish_detail]['dish_price'].append(child_div_dish_price.get_text())
        return menu_details


if __name__ == '__main__':
    if browser is None:
        sys.exit()
    out_file = open("zomato_menu.json", "a")
    with open("order_online_menu.txt", "r", encoding="utf-8") as f:
        for line in f:
            zr = ZomatoRestaurant(line)
            json.dump(zr.scrap(), out_file)
            out_file.write('\n')
    out_file.close()
    browser.close()

Zomato Web响应是动态的,它过滤掉看起来像是机器人发出的请求。出于同样的原因,我无法使用urllib,请求等python库来进行HTML辅助功能的请求调用。 因此,我使用浏览器而不是Zomato发出请求。我所需要的只是一个浏览器,以我为例,Mozilla Firefox可以简化脚本。我在Selenium中使用了Gecko驱动程序来与Firefox交互。Selenium无法像大多数其他软件包一样模拟浏览器会话,它是一个实际的浏览器会话。硒的编写基本上就是编写一组动作并将其提供给浏览器。

我面临的问题是该网站结构复杂,我编写的代码无法从该网站获取菜单,但是我找不到它的缺陷。有人可以尝试找出错误或写错什么导致我的输出JSON文件具有空白属性,例如:

{"restaurant_name": "", "dish_mappings": []}
{"restaurant_name": "", "dish_mappings": []}

我用于此代码的示例链接为: Order Page of a restaurant in Indore

Order page of another restaurant

0 个答案:

没有答案