Question

尽管我有一些硒方面的经验，但我还是Python的新手，并且第一次使用漂亮的汤。我正在尝试抓取一个网站（“ http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx”）以获取所有的会员号码。

问题在于它们在多页上（第20个结果（共1个），共21,000个结果）

因此，我希望以某种可以在下一页btn上进行迭代的循环方式将它们抓取，网页URL的问题不会改变，因此没有任何模式。

好吧，为此我尝试了，谷歌表导入HTML /导入XML方法，但是由于问题的规模很大，它只是挂了。接下来，我尝试使用python，并开始阅读有关使用python进行抓取的信息（这是我第一次这样做：））在该平台上有人建议了一种方法

（Python Requests/BeautifulSoup access to pagination）

我试图做同样的事情，但是收效甚微。

此外，要获取结果，我们必须首先用关键字“ a”->查询搜索栏，然后单击“搜索”。只有这样，网站才会显示结果。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by  import By
import time

options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe",options=options)

driver.get("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx")
#click on the radio btn
driver.find_element(By.ID,'optlist_0').click()

time.sleep(2)

# Search the query with letter A And Click Search btn
driver.find_element(By.ID,'keytext').send_Keys("a")
driver.find_element(By.ID,'search').click()

time.sleep(2)

next_button = driver.find_element_by_id("Button1")
data = []
try:
    while (next_button):    
        soup = BeautifulSoup(driver.page_source,'html.parser')
        table = soup.find('table',{'id':'T1'}) #Main Table
        table_body = table.find('tbody') #get inside the body
        rows = table_body.find_all('tr') #look for all tablerow
        for row in rows:            
            cols = row.find_all('td')  # in every Tablerow, look for tabledata
                for row2 in cols:
                    #table -> tbody ->tr ->td -><b> --> exit loop. ( only first tr is our required data, print this)

我期望的最终结果是跨多个页面的所有关联编号列表。

Answer 1

在while循环中对代码进行了少量补充：

next_button = 1 #Initialise the variable for the first instance of while loop

while next_button:
    #First scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
    #Now locate the button & click on it
    next_button = driver.find_element(By.ID,'Button1')
    next_button.click()
    ###
    ###Beautiful Soup Code : Fetch the page source now & do your thing###
    ###
    #Adjust the timing as per your requirement
    time.sleep(2)

请注意以下事实：滚动到页面底部很重要，否则将弹出错误消息，提示“ Button1”元素隐藏在页脚下方。因此，使用脚本（在循环的开头），浏览器将向下移动到页面底部。在这里，可以清楚地看到“ Button1”元素。现在，找到该元素，执行单击操作，然后让您的Beautiful Soup接管。

如何使用漂亮的汤4以及python和selenium来循环页面？

1 个答案: