Scrape"下一页"在Python中

时间:2017-11-07 02:30:10

标签: python html beautifulsoup scrape

我正在尝试抓一个网页的下一页。它们总共20页。我想用第一页的网址抓下一页。

代码:

b=[]
url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
res=requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
b.append(url)
while True:   
    try:
        dct = {"data-icon":"k"}
        url=soup.find('',dct)
        url=(url['href'])
        print(url)
    except TypeError:   
        break
    if url:
        url=("https://abcde.com"+url)
        print(url)  
        b.append(url) 
print(b)

下一页的HTML:

<li class="next"><a href="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/?p=2" data-icon="k">next page</a></li>

最后一页的HTML:

<li class="next disabled"><a href="" data-icon="k">next page</a></li>

它只打印出第一页的网址。

1 个答案:

答案 0 :(得分:0)

您期望发生什么?您只需拨打requests.get(url)一次,即在您输入while True循环之前。您需要将res=requests.get(url)和所有后续行放在while循环中,以便您的代码实际获取后续页面。例如:

# The following are used for debugging output in this example:
#import sys
#import traceback

# ... Your other code...

b=[]
url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
b.append(url)
while True:
    try:
        res=requests.get(url)
    except:
        print("Failed while fetching " + str(url))
        print("Stack trace:")
        traceback.print_exc()
        break;
    # end try
    try:
        soup = BeautifulSoup(res.text,"lxml")
    except:
        print("Failed setting up beautiful soup parser object.")
        print("Response from request for '" + str(url) + "' was: \n\t" + str(res).replace("\n", "\n\t"), file=sys.stderr) # Avoids polluting STDOUT
        traceback.print_exc()
        break;
    # end try

    # The following line is not needed here because the new URL is added in the IF statement at the bottom of loop:
    # b.append(url)

    try:
        dct = {"data-icon":"k"}
        url=soup.find('',dct)
        url=(url['href'])
        print(url)
    except TypeError:
        print("Leaving loop after Parsing of URL from page failed.")
        break
    if url:
        url=("https://abcde.com"+url)
        print(url)  
        b.append(url)
# end while True

# Debug statement:
print("Outside of loop.")

# Print output
print(b)

每次都会请求新URL的页面,因为requests.get(url)位于循环内部,导致它在每次迭代时执行。