我想从一个网站中删除多个网页。这样的模式:
https://www.example.com/S1-3-1.html https://www.example.com/S1-3-2.html https://www.example.com/S1-3-3.html https://www.example.com/S1-3-4.html https://www.example.com/S1-3-5.html。
我尝试了三种方法来刮掉所有这些页面,但每种方法只刮掉第一页。我展示了下面的代码,任何人都可以检查并告诉我什么问题将受到高度赞赏。
===============method 1====================
import requests
for i in range(5): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
===============method 2=============
import urllib2,sys
from bs4 import BeautifulSoup
for numb in ('1', '5'):
address = ('https://www.example.com/S1-3-' + numb + '.html')
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html,'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
=============method 3==============
import requests
from bs4 import BeautifulSoup
url = 'https://www.example.com/S1-3-1.html'
for round in range(5):
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
paging = soup.select('div.paging a')
next_url = 'https://www.example.com/'+paging[-1]['href'] # paging[-1]['href'] is next page button on the page
url = next_url
我检查了一些答案并检查了,但这不是循环问题,请检查下面显示的图像,它只是第一页结果。几天真的让我生气了 please see photo:only first page results, results picture 2
答案 0 :(得分:2)
你的缩进不正常。
尝试(方法1)
from bs4 import BeautifulSoup
import requests
for i in range(1, 6): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
答案 1 :(得分:1)
您的页面分析应该在循环内部,就像这样,否则,它只会使用一个页面:
.......
for i in range(5): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
........
答案 2 :(得分:1)
首先,你必须在循环内部引入所有订单,否则,只能在最后一次迭代中使用。
其次, 您可以在每次迭代结束时尝试关闭请求会话:
import requests
for i in range(5): # Number of pages plus one
url = "https://www.example.com/S1-3-{}.html".format(i)
r = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'class':'product-item item-template-0 alternative'})
r.close()