使用selenium和python抓取数据时迭代单击

时间:2018-02-14 12:22:36

标签: python selenium-webdriver web-scraping beautifulsoup

我正在尝试从此网页抓取数据

http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=5;template=results;type=batting

我需要从表中复制内容并将它们放在csv文件中,然后转到下一页并将这些页面的内容附加到同一个文件中。我能够刮掉桌子但是当我尝试使用selenium webdriver的点击循环点击下一个按钮时,它会转到下一页并停止。这是我的代码。

    driver = webdriver.Chrome(executable_path = 'path')
    url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=5;template=results;type=batting'
def data_from_cricinfo(url):
    driver.get(url)
    pgsource = str(driver.page_source)
    soup = BeautifulSoup(pgsource, 'html5lib')
    data = soup.find_all('div', class_ = 'engineTable')
    for tr in data:
        info = tr.find_all('tr')
             # grab data

    next_link = driver.find_element_by_class_name('PaginationLink')
    next_link.click()
data_from_cricinfo(url)

是否仍然使用循环单击所有页面的next并将所有页面的内容复制到同一文件中?提前谢谢。

1 个答案:

答案 0 :(得分:1)

您可以执行以下操作来遍历所有页面(通过Next按钮)并解析表格中的数据:

from selenium import webdriver
from bs4 import BeautifulSoup

URL = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=5;template=results;type=batting'

driver = webdriver.Chrome()
driver.get(URL)

while True:
    soup = BeautifulSoup(driver.page_source, 'html5lib')
    table = soup.find_all(class_='engineTable')[2]
    for info in table.find_all('tr'):
        data = [item.text for item in info.find_all("td")]
        print(data)

    try:
        driver.find_element_by_partial_link_text('Next').click()
    except:
        break

driver.quit()