使用硒访问下一页

时间:2019-06-04 18:28:54

标签: python-3.x selenium

首先,直到昨天我才使用过硒。经过多次尝试,我能够正确地刮擦目标表。

我目前正在尝试在顺序页面上抓取表格。有时它会工作,而有时它会立即失败。我已经花了几个小时在Google和Stack Overflow上冲浪,但是我还没有解决问题。我相信答案很简单,但是8个小时后,我需要向硒专家询问一个问题。

我的目标网址是:RedHat Security Advisories

如果Stack Overflow上有一个问题可以解决我的问题,请告诉我,我将做一些研究和测试。

以下是我尝试过的一些物品:

示例1:

page_number = 0
while True:
  try:
    page_number += 1

    browser.execute_script("return arguments[0].scrollIntoView(true);",
                           WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
                                                                                                  '2]/dir-pagination-controls/ul/li[str(page_number))]'))))

    browser.find_element_by_xpath('//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[str(page_number)').click()

    print(f"Navigating to page {page_number}")

    # I added this because my connection was 
    # being terminated by RedHat
    time.sleep(20)

except (TimeoutException, WebDriverException) as e:
    print("Last page reached")
    break

except Exception as e:
    print (e)
    break

示例2:

page_number = 0
  while True:
   try:
     page_number += 1

     browser.execute_script("return arguments[0].scrollIntoView(true);",
                           WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
                                                                                                  '2]/dir-pagination-controls/ul/li[12]'))))

     browser.find_element_by_xpath('//*[@id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[12]').click()

     print(f"Navigating to page {page_number}")

     # I added this because my connection was 
     # being terminated by RedHat
     time.sleep(20)

 except (TimeoutException, WebDriverException) as e:
     print("Last page reached")
     break

 except Exception as e:
    print (e)
    break

2 个答案:

答案 0 :(得分:1)

您可以使用以下逻辑。

lastPage = WebDriverWait(driver,120).until(EC.element_to_be_clickable((By.XPATH,"(//ul[starts-with(@class,'pagination hidden-xs ng-scope')]/li[starts-with(@ng-repeat,'pageNumber')])[last()]")))
driver.find_element_by_css_selector("i.web-icon-plus").click()
pages = lastPage.text
pages = '5'
for pNumber in range(1,int(pages)):
    currentPage = WebDriverWait(driver,30).until(EC.element_to_be_clickable((By.XPATH,"//ul[starts-with(@class,'pagination hidden-xs ng-scope')]//a[.='" + str(pNumber) + "']")))
    print ("===============================================")
    print("Current Page : " + currentPage.text)
    currentPage.location_once_scrolled_into_view
    currentPage.click()
    WebDriverWait(driver,120).until_not(EC.element_to_be_clickable((By.CSS_SELECTOR,"#loading")))
    # print rows data here
    rows = driver.find_elements_by_xpath("//table[starts-with(@class,'cve-table')]/tbody/tr") #<== getting rows here
    for row in rows:
        print (row.text) <== I am printing all row data, if you want cell data please update the logic accordingly
    time.sleep(randint(1, 5)) #<== this step is optional

答案 1 :(得分:0)

我相信您可以直接使用url读取数据,而无需尝试分页,这将减少同步问题,因为脚本可能会失败

  1. 使用此xpath可以获取安全更新表的页面总数。 // * [@@ id =“ jumpPoint”] / div [3] / div / div / div [2] / div / div [2] / dir-pagination-controls / ul / li [11]

    < / li>
  2. 运行循环,直到从步骤1获得页数 网址下方的内部循环传递页码并发送get请求 https://access.redhat.com/security/security-updates/#/security-advisories?q=&p= 页数&sort = portal_publication_date%20desc&rows = 10&portal_advisory_type = Security%20Advisory&documentKind = PortalProduct

  3. 等待页面加载

  4. 从页面上填充的表中读取数据

  5. 此过程将一直进行到分页计数为止

  6. 如果发现站点阻止用户的特定错误,则可以使用相同的page_number刷新页面。