硒和旋转容器

时间:2017-01-04 18:32:33

标签: python parsing selenium web-scraping bs4

有一个页面,其中包含一个表格和一个刷新表格的下一个按钮。我现在能够提取表的内容,但需要使用下一个按钮移动到其他行。这是某种类型的ajax表,没有刷新页面的href。因此我被卡住了。该页面为https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017

1 个答案:

答案 0 :(得分:1)

我要做以下事情:

  • 开始无休止的循环
  • 点击下一个按钮 - 如果失败 - 退出循环(这是你的"休息"条件)
  • 等待表加载包装器的隐身
  • 收集球员数据

示例实现(仅使用selenium,但您可能需要让BeautifulSoup参与玩家数据解析 - 应该更快一些):

from pprint import pprint

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotVisibleException

root = "https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017"
driver = webdriver.PhantomJS()
driver.get(root)


wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#statistics-table-summary .player-link")))

# get the first 10 players
players = [player.text for player in driver.find_elements_by_css_selector("#statistics-table-summary .player-link")]

while True:
    try:
        # click Next
        driver.find_element_by_link_text("next").click()
    except ElementNotVisibleException:
        break  # next is not present/visible

    wait.until(EC.invisibility_of_element_located((By.ID, "statistics-table-summary-loading")))

    # collect the next 10 players
    players += [player.text for player in driver.find_elements_by_css_selector("#statistics-table-summary .player-link")]
    print(len(players))

pprint(players)
driver.close()

请注意,就解析而言,为了提高性能,请使用SoupStrainer仅解析相关表。