我正在尝试抓一个网页的下一页。它们总共20页。我想用第一页的网址抓下一页。
代码:
b=[]
url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
res=requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
b.append(url)
while True:
try:
dct = {"data-icon":"k"}
url=soup.find('',dct)
url=(url['href'])
print(url)
except TypeError:
break
if url:
url=("https://abcde.com"+url)
print(url)
b.append(url)
print(b)
下一页的HTML:
<li class="next"><a href="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/?p=2" data-icon="k">next page</a></li>
最后一页的HTML:
<li class="next disabled"><a href="" data-icon="k">next page</a></li>
它只打印出第一页的网址。
答案 0 :(得分:0)
您期望发生什么?您只需拨打requests.get(url)
一次,即在您输入while True
循环之前。您需要将res=requests.get(url)
和所有后续行放在while循环中,以便您的代码实际获取后续页面。例如:
# The following are used for debugging output in this example:
#import sys
#import traceback
# ... Your other code...
b=[]
url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
b.append(url)
while True:
try:
res=requests.get(url)
except:
print("Failed while fetching " + str(url))
print("Stack trace:")
traceback.print_exc()
break;
# end try
try:
soup = BeautifulSoup(res.text,"lxml")
except:
print("Failed setting up beautiful soup parser object.")
print("Response from request for '" + str(url) + "' was: \n\t" + str(res).replace("\n", "\n\t"), file=sys.stderr) # Avoids polluting STDOUT
traceback.print_exc()
break;
# end try
# The following line is not needed here because the new URL is added in the IF statement at the bottom of loop:
# b.append(url)
try:
dct = {"data-icon":"k"}
url=soup.find('',dct)
url=(url['href'])
print(url)
except TypeError:
print("Leaving loop after Parsing of URL from page failed.")
break
if url:
url=("https://abcde.com"+url)
print(url)
b.append(url)
# end while True
# Debug statement:
print("Outside of loop.")
# Print output
print(b)
每次都会请求新URL的页面,因为requests.get(url)
位于循环内部,导致它在每次迭代时执行。