使用Python从TripAdvisor餐厅中提取菜肴类型

时间:2016-10-13 17:34:29

标签: python python-2.7 web-scraping beautifulsoup

我目前正试图从不同国家的TripAdvisor餐厅提取数据。我想要的字段是姓名,地址和菜肴类型(中餐,牛排馆等)。我已经成功地使用我的脚本提取姓名和地址;然而,拉菜类型对我来说非常困难。如果你看一下下面的内容,你会发现我想从TripAdvisor和我的代码中提取的截图。

What I want to pull from TripAdvisor is circled in red.

When I print my code it keeps printing 'Asian' even thought the second one should be a 'Steakhouse'.

#import libraries
import requests
from bs4 import BeautifulSoup
import csv

#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
    #need this here for when you want more than 30 entries pulled
    while i <= range:
        i = str(i)
        #url format offsets the restaurants in increments of 30 after the oa
        url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
        r1 = requests.get(url1)
        data1 = r1.text
        soup1 = BeautifulSoup(data1, "html.parser")
        for link in soup1.findAll('a', {'property_title'}):
            #print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
            restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
            #print link.string
            account_name = link.string.strip()
            #cuisine type pull
            for link in soup1.findAll('a', {'cuisine'}):
                cuisinetype = link.string.strip()
            r_address = requests.get(restaurant_url)
            r_addresstext = r_address.text
            soup2 = BeautifulSoup(r_addresstext, "html.parser")
            for restaurant_url in soup2.findAll('span', {'street-address'})[0]:
                #print(restaurant_url.string)
                rest_address = restaurant_url.string
                rest_array = [account_name, rest_address, cuisinetype]
                print rest_array
                #with open('ListingsPull-HongKong.csv', 'a') as file:
                    #writer = csv.writer(file)
                    #writer.writerow([account_name, rest_address])
        break

1 个答案:

答案 0 :(得分:0)

这种方法并不是特别优雅,但您可能会接受。我注意到您想要的信息似乎在“烹饪”的“详细信息”选项卡下重复。我发现以这种方式访问​​更容易。

>>> import requests
>>> from bs4 import BeautifulSoup
>>> restaurant_url='https://www.tripadvisor.ca/Restaurant_Review-g294217-d2399904-Reviews-Tin_Lung_Heen-Hong_Kong.html'
>>> soup2 = BeautifulSoup(requests.get(restaurant_url).text, "html.parser")
>>> street_address=soup2.find('span',{'street-address'})
>>> street_address
<span class="street-address" property="streetAddress">International Commerce Centre, 1 Austin Road West, Kowloon</span>
>>> street_address.contents[0]
'International Commerce Centre, 1 Austin Road West, Kowloon'
>>> for item in soup2.findAll('div', attrs={'class', 'title'}):
...     if 'Cuisine' in item.text:
... 
...         item.text.strip()
...         break
...         
'Cuisine'
>>> content=item.findNext('div', attrs={'class', 'content'})
>>> content
<div class="content">
Chinese, Asian
</div>
>>> content.text
'\nChinese,\xa0Asian\n'
>>> content.text.strip().split('\xa0')
['Chinese,', 'Asian']