我目前正试图在TripAdvisor上抓住新加坡500强餐厅;但是,我当前的代码只拉出前30个并保持循环,直到打印出前30个,直到它达到500个记录。我希望它打印前30页,然后打印下一页30,依此类推。我想知道是否有人可以查看我的代码,看看为什么要这样做。
#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 500, 30):
#url format offsets the restaurants in increments of 30 after the oa
#change key and geography here
url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
#change key here
restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href')
print restaurant_url
答案 0 :(得分:2)
我认为你在这里制作了错误的网址:
url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS'
正确的网址格式应为:
url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(i)
请注意“页面偏移”之后的短划线。
我还会维护一个网络抓取会话并改进变量命名:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
for offset in range(0, 500, 30):
url = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(offset)
soup = BeautifulSoup(session.get(url).content, "html.parser")
for link in soup.select('a.property_title'):
restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href')
print(restaurant_url)
另外,请考虑在后续请求之间添加延迟,以便更好web-scraping citizen。