我目前正试图从不同国家的TripAdvisor餐厅提取数据。我想要的字段是姓名,地址和菜肴类型(中餐,牛排馆等)。我已经成功地使用我的脚本提取姓名和地址;然而,拉菜类型对我来说非常困难。如果你看一下下面的内容,你会发现我想从TripAdvisor和我的代码中提取的截图。
What I want to pull from TripAdvisor is circled in red.
When I print my code it keeps printing 'Asian' even thought the second one should be a 'Steakhouse'.
#import libraries
import requests
from bs4 import BeautifulSoup
import csv
#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
#need this here for when you want more than 30 entries pulled
while i <= range:
i = str(i)
#url format offsets the restaurants in increments of 30 after the oa
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
#print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
#print link.string
account_name = link.string.strip()
#cuisine type pull
for link in soup1.findAll('a', {'cuisine'}):
cuisinetype = link.string.strip()
r_address = requests.get(restaurant_url)
r_addresstext = r_address.text
soup2 = BeautifulSoup(r_addresstext, "html.parser")
for restaurant_url in soup2.findAll('span', {'street-address'})[0]:
#print(restaurant_url.string)
rest_address = restaurant_url.string
rest_array = [account_name, rest_address, cuisinetype]
print rest_array
#with open('ListingsPull-HongKong.csv', 'a') as file:
#writer = csv.writer(file)
#writer.writerow([account_name, rest_address])
break
答案 0 :(得分:0)
这种方法并不是特别优雅,但您可能会接受。我注意到您想要的信息似乎在“烹饪”的“详细信息”选项卡下重复。我发现以这种方式访问更容易。
>>> import requests
>>> from bs4 import BeautifulSoup
>>> restaurant_url='https://www.tripadvisor.ca/Restaurant_Review-g294217-d2399904-Reviews-Tin_Lung_Heen-Hong_Kong.html'
>>> soup2 = BeautifulSoup(requests.get(restaurant_url).text, "html.parser")
>>> street_address=soup2.find('span',{'street-address'})
>>> street_address
<span class="street-address" property="streetAddress">International Commerce Centre, 1 Austin Road West, Kowloon</span>
>>> street_address.contents[0]
'International Commerce Centre, 1 Austin Road West, Kowloon'
>>> for item in soup2.findAll('div', attrs={'class', 'title'}):
... if 'Cuisine' in item.text:
...
... item.text.strip()
... break
...
'Cuisine'
>>> content=item.findNext('div', attrs={'class', 'content'})
>>> content
<div class="content">
Chinese, Asian
</div>
>>> content.text
'\nChinese,\xa0Asian\n'
>>> content.text.strip().split('\xa0')
['Chinese,', 'Asian']