Question

我正在使用scrapy和beautifulsoup来清理美国不同城市的所有酒店。

当我到达一个名为“旧金山酒店”的页面时，它只包含了该市250个酒店中的30家酒店。点击“列表中的下一个30”不会更改网址，也不会更改排序参数。我的问题：我如何才能到达250家酒店的整个列表，或者选择要从中获取的排名。感谢。

到目前为止我的代码：

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
headers = soup.find_all("h1",{"class":"X"})

for header in headers:
    headerText = header.text
    match=re.search('(.+ Hotels)',headerText)
    if match:
        writeHotels(soup,match.group(0))



def writeHotels(soup,location):

   #create Hotels directory
   hotelDir = 'Hotels/'
   if not os.path.exists(hotelDir):
       os.makedirs(hotelDir)


   hotels = soup.find_all("a",{"class":"Y"})
   name=location+'.txt'
   #write hotels to file
   if os.path.exists(hotelDir+name):
       print 'opening file '+name+"\n"
   else:
       print 'creating file '+name+"\n"
   file=open(hotelDir+name,'a') 
   for hotel in hotels:
       file.write(hotel.text+"\n")
   file.close()

Answer 1

如果您在页面底部的页码中查看页面源，则每个页面都有一个唯一的URL。如果你打印出汤，你会发现你可以抓住那个网址。如果有很多页面，它将不会显示所有页面，只有...用于中间页面。但是，您可以从第一个和最后一个值计算网址（我在下面没有这样做）。这是我使用的代码：

url = "http://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html" 
page=urllib.request.urlopen(url)

soup = BeautifulSoup(page.read())
#print(soup)
for myValue3 in soup.findAll("a",attrs={ "class" : "pageNum" }):
    try:
        print("the value of page " + myValue3.get("data-page-number") + " is: " + myValue3.get("href").split("#ACCOM_OVERVIEW")[0])
    except:
        print("error")

这是输出

the value of page 2 is: /Hotels-g60713-oa30-San_Francisco_California-Hotels.html
the value of page 3 is: /Hotels-g60713-oa60-San_Francisco_California-Hotels.html
the value of page 4 is: /Hotels-g60713-oa90-San_Francisco_California-Hotels.html
the value of page 5 is: /Hotels-g60713-oa120-San_Francisco_California-Hotels.html
the value of page 6 is: /Hotels-g60713-oa150-San_Francisco_California-Hotels.html
the value of page 8 is: /Hotels-g60713-oa210-San_Francisco_California-Hotels.html

注意网址中的-oa###-。这可以改变，你可以得到所有后续页面。

在同一个网址下刮取不同的值（Cookie？）

1 个答案: