如何从一个网站抓取多个网页

时间:2018-03-05 07:57:02

标签: python scrape

我想从一个网站中删除多个网页。这样的模式:

https://www.example.com/S1-3-1.html https://www.example.com/S1-3-2.html https://www.example.com/S1-3-3.html https://www.example.com/S1-3-4.html https://www.example.com/S1-3-5.html

我尝试了三种方法来刮掉所有这些页面,但每种方法只刮掉第一页。我展示了下面的代码,任何人都可以检查并告诉我什么问题将受到高度赞赏。

 ===============method 1====================
    import requests  
    for i in range(5):      # Number of pages plus one 
        url = "https://www.example.com/S1-3-{}.html".format(i)
        r = requests.get(url)
    from bs4 import BeautifulSoup  
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    ===============method 2=============
    import urllib2,sys
    from bs4 import BeautifulSoup
    for numb in ('1', '5'):
        address = ('https://www.example.com/S1-3-' + numb + '.html')
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html,'html.parser')
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    =============method 3==============
    import requests 
    from bs4 import BeautifulSoup  
    url = 'https://www.example.com/S1-3-1.html'
    for round in range(5):
        res = requests.get(url)
        soup = BeautifulSoup(res.text,'html.parser')
        results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
        paging = soup.select('div.paging a')
        next_url = 'https://www.example.com/'+paging[-1]['href'] # paging[-1]['href'] is next page button on the page 
        url = next_url

我检查了一些答案并检查了,但这不是循环问题,请检查下面显示的图像,它只是第一页结果。几天真的让我生气了 please see photo:only first page resultsresults picture 2

3 个答案:

答案 0 :(得分:2)

你的缩进不正常。

尝试(方法1)

from bs4 import BeautifulSoup 
import requests

for i in range(1, 6):      # Number of pages plus one 
    url = "https://www.example.com/S1-3-{}.html".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})

答案 1 :(得分:1)

您的页面分析应该在循环内部,就像这样,否则,它只会使用一个页面:

.......
    for i in range(5):      # Number of pages plus one 
        url = "https://www.example.com/S1-3-{}.html".format(i)
        r = requests.get(url)
        from bs4 import BeautifulSoup  
        soup = BeautifulSoup(r.text, 'html.parser')  
        results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
........

答案 2 :(得分:1)

首先,你必须在循环内部引入所有订单,否则,只能在最后一次迭代中使用。

其次, 您可以在每次迭代结束时尝试关闭请求会话:

import requests  
    for i in range(5):      # Number of pages plus one 
    url = "https://www.example.com/S1-3-{}.html".format(i)
    r = requests.get(url)
    from bs4 import BeautifulSoup  
    soup = BeautifulSoup(r.text, 'html.parser')  
    results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
    r.close()