如何获取最后一个偏移并循环遍历每个页面?

时间:2017-04-17 09:55:58

标签: python web-scraping beautifulsoup tripadvisor

我想从'tripadvisor.com'获取'堪培拉'所有活动的清单。
以前我使用相同的酒店方法,它工作得很好,但现在我已经尝试了一切,以便在“堪培拉”中做所有事情,它怎么会失败? 这是我的堪培拉待办事项列表:

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen



offset = 0
url = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
urls = []
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")


for link in soup.find_all('a', {'last'}):
    page_number = link.get('data-page-number')
    last_offset = int(page_number) * 30
    print('last offset:', last_offset)


for offset in range(0, last_offset, 30):
    print('--- page offset:', offset, '---')
    url = 'https://www.tripadvisor.com/Attractions-g255057-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.find_all('a', {'property_title'}):
        iurl='https://www.tripadvisor.com/Attraction_Review-g255057' + link.get('href')
        print(iurl)

之前我将此代码用于酒店:

import requests
import re
import time
import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen

offset = 0
url = 'https://www.tripadvisor.com/Hotels-g255057-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#EATERY_LIST_CONTENTS'

urls = []
r=requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")


for link in soup.find_all('a', {'last'}):
    page_number = link.get('data-page-number')
    last_offset = int(page_number) * 30
    print('last offset:', last_offset)


for offset in range(0, last_offset, 30):
    print('--- page offset:', offset, '---')
    url = 'https://www.tripadvisor.com/Hotels-g255057-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#EATERY_LIST_CONTENTS'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.find_all('a', {'property_title'}):
        iurl='https://www.tripadvisor.com/' + link.get('href')
        print(iurl)       

0 个答案:

没有答案