Web抓取-Selenium BeautifulSoup-通过分页进行循环

时间:2019-09-27 14:10:43

标签: python web-scraping beautifulsoup

我正试图弄乱硒(只是学习一些东西-问了一些有关beautifulsoup的问题,并提出了一些很好的建议。

无论如何,我只是简单地尝试循环浏览页面并获取div.details并打印发现的数量(作为初始测试)。问题是它似乎只是坐在首页上并重新加载而被卡在了循环中。

我将如何更改它,使其在第1页,第2页中循环然后结束?

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"


from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

page = 1

driver = webdriver.Chrome(ChromeDriverManager().install())
url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={page}"





#grab all links which contain the href specifed

with requests.Session() as session:
  while True:
    res=session.get(url.format(page=page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details')
    if soup.select("nav_next") is None:
        break
    page += 1
    driver.get(url) #navigate to the page
print(len(gun_details))

1 个答案:

答案 0 :(得分:1)

您不需要硒来导航,可以使用请求方法来完成。

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"

page = 1
url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={}"

with requests.Session() as session:
  while True:
    print(url.format(page))
    res=session.get(url.format(page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details')
    print(len(gun_details))
    if len(soup.select(".nav_next"))==0:
        break
    page += 1

我已经提供了打印和控制它的功能。

https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page=1
10
https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page=2
4