基于列表执行循环 - 获取每个页面的结果(子页面)

时间:2017-07-20 07:19:05

标签: python loops for-loop web-scraping beautifulsoup

我正在尝试从网址列表中获取每个网址的页数。只要我只有一个网址,我的代码就可以工作,但是只要我尝试使用网址列表,我就会从一个网址中获取剩余的网址。猜猜问题与我的循环有关。鉴于我是蟒蛇和美化汤的新手,我不能自己发现错误。

base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'
urls=[]

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


for i in range (1,last_page):
    page = main_page.format(i)
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
       urls = base_url + link.find('a').get('href')+"/-/p/{}"
       print(urls)

到目前为止,一切正常,我获取了最大页码并从每个页面获取所有网址。问题出在下面的代码中(我相信):

for url in urls: #to loop through the list of urls
    r = requests.get(url.format(0)) 
    soup = BeautifulSoup(r.text, 'lxml')
    daten = soup.find_all('a', {'class':'link'})

    tes = []
    for z in daten:
        tes.append(z.text) #writing each value to res list
       print(tes)

    tes_int = []
    for z in tes:
        try:
            tes_int.append(int(z))
        except:
            print("current value is not a number")
    anzahl=max(tes_int)
    print(anzahl)

我正在尝试对列表网址中的每个网址应用与上面代码中相同的概念 - 但不是每次获取每个网址的最大页码数量,而是每次都获得241,就像我陷入了循环一样。 ..

有什么想法吗?非常感谢帮助。

1 个答案:

答案 0 :(得分:1)

urls等同于循环生成的最后一个链接 要构建有效的网址列表,您需要替换=上的append()

urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try:
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)
print(urls)

编辑:好吧,据我所知,你的代码中有几个问题。随着我最初的修复,我正在概述我的愿景并理解你的代码如何工作:

import requests
from bs4 import BeautifulSoup
base_url = 'https://www.holidaycheck.de'
main_page = 'https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1?p={}'

##Change URL into object (soup)
r = requests.get(main_page.format(0))  
soup = BeautifulSoup(r.text, "html5lib")

#get max page number
soup = BeautifulSoup(r.text, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
last_page=max(res_int)
#print(last_page)


urls = []
for i in range (1,last_page):
    page = main_page.format(i)
    r = requests.get(page) #these 2 rows added
    soup = BeautifulSoup(r.text, 'lxml') #these 2 rows added
    for link in soup.find_all('div', {'class':'hotel-reviews-bar'}):
        try: #also adding try-except for escaping broken/unavailable links
            urls.append(base_url + link.find('a').get('href')+"/-/p/{}")
        except:
            print('no link available', i)

urls = list(set(urls)) #check and drop duplicated in links list

for url in urls: #to loop through the list of urls
    try:
        r = requests.get(url.format(0))
        print(url.format(0))
        soup = BeautifulSoup(r.text, 'lxml')
        daten = soup.find_all('a', {'class':'link'})
    except:
        print('broken link')

    tes = []
    for z in daten:
        tes.append(z.text) #writing each value to res list
#    print(tes)

    tes_int = []
    for z in tes:
        try:
            tes_int.append(int(z))
        except:
            print("current value is not a number")
    try:
        anzahl=max(tes_int)
        print(anzahl)
    except:
        print('maximum cannot be calculated')