循环以抓取多个网页而不循环

时间:2016-12-06 17:08:06

标签: python-2.7 web-scraping beautifulsoup

我目前正试图在TripAdvisor上抓住新加坡500强餐厅;但是,我当前的代码只拉出前30个并保持循环,直到打印出前30个,直到它达到500个记录。我希望它打印前30页,然后打印下一页30,依此类推。我想知道是否有人可以查看我的代码,看看为什么要这样做。

#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 500, 30):
    #url format offsets the restaurants in increments of 30 after the oa
    #change key and geography here
    url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS'
    r1 = requests.get(url1)
    data1 = r1.text
    soup1 = BeautifulSoup(data1, "html.parser")
    for link in soup1.findAll('a', {'property_title'}):
        #change key here
        restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href')
        print restaurant_url

1 个答案:

答案 0 :(得分:2)

我认为你在这里制作了错误的网址:

url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa' + str(i) + 'Singapore.html#EATERY_LIST_CONTENTS'

正确的网址格式应为:

url1 = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(i)

请注意“页面偏移”之后的短划线。

我还会维护一个网络抓取会话并改进变量命名:

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    for offset in range(0, 500, 30):
        url = 'https://www.tripadvisor.com/Restaurants-g294265-oa{0}-Singapore.html#EATERY_LIST_CONTENTS'.format(offset)

        soup = BeautifulSoup(session.get(url).content, "html.parser")
        for link in soup.select('a.property_title'):
            restaurant_url = 'https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href')
            print(restaurant_url)

另外,请考虑在后续请求之间添加延迟,以便更好web-scraping citizen