在python中通过硒分页导航

时间:2018-08-08 10:00:58

标签: python selenium selenium-webdriver web-scraping

我正在使用Python和Selenium抓取此网站。我的代码可以正常工作,但是目前它仅刮擦第一页,我想遍历所有页面并刮擦所有页面,但是它们以一种奇怪的方式处理分页,我将如何浏览页面并一步一步地刮擦它们? / p>

分页HTML:

<div class="pagination">
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to first page">First</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to previous page">Prev</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to page 1">1</a>
    <span class="current">2</span>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to page 3">3</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to page 4">4</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to next page">Next</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to last page">Last</a>
</div>

刮板:

import re
import json
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options

options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, 
executable_path=r'/Users/weaabduljamac/Downloads/chromedriver')

url = 'https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList'
driver.get(url)

def getData():
  data = []
  rows = driver.find_element_by_xpath('//*[@id="form1"]/table/tbody').find_elements_by_tag_name('tr')
 for row in rows:
    app_number = row.find_elements_by_tag_name('td')[1].text
    address =  row.find_elements_by_tag_name('td')[2].text
    proposals =  row.find_elements_by_tag_name('td')[3].text
    status =  row.find_elements_by_tag_name('td')[4].text
    data.append({"CaseRef": app_number, "address": address, "proposals": proposals, "status": status})
print(data)
return data


def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    all_data.extend( getData() )
    driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()


if __name__ == "__main__":
    main()

3 个答案:

答案 0 :(得分:3)

在继续执行任何方案的自动化之前,请务必写下执行方案所需执行的手动步骤。您想要的手动步骤(我从问题中了解到)是-

1)转到网站-https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList

2)选择第一周选项

3)点击搜索

4)从每个页面获取数据

5)再次加载网址

6)选择第二周选项

7)点击搜索

8)从每个页面

获取数据

..依此类推。

您有一个循环来选择不同的星期,但在“周”选项的每次循环迭代中,还需要包括一个循环以迭代所有页面。由于您未执行此操作,因此您的代码仅返回第一页中的数据。

另一个问题是您如何定位“下一步”按钮-

driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()

您选择的第四个<a>元素当然不可靠,因为在不同页面中,“下一步”按钮的索引将不同。而是使用更好的定位器-

driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()

用于创建循环遍历页面的逻辑-

首先,您需要页面数。为此,我将“ <a>”定位在“下一步”按钮的紧前之前。如下面的屏幕截图所示,很明显,此元素的文本将等于页面数-

screenshot-

我使用以下代码做到了这一点-

number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)

现在,将页数设为number_of_pages后,只需单击number_of_pages - 1次“下一步”按钮!

main函数的最终代码-

def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
    for j in range(number_of_pages - 1):
      all_data.extend(getData())
      driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
      time.sleep(1)
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()

答案 1 :(得分:0)

首先使用

获取分页中的总页数
ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,1')
ins.find_element_by_class_name("pagination")
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', {'class':'pagination'})
all_as = div[0].find_all('a')
total = 0

for i in range(len(all_as)):
    if 'Next' in all_as[i].text:
        total = all_as[i-1].text
        break

现在只需遍历范围

for i in range(total):
 ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,{}'.format(count))

继续增加计数,获取页面的源代码,然后获取其数据。 注意:点击一页进入另一页时,请不要忘记睡眠

答案 2 :(得分:0)

下面的方法对我来说很简单。

driver.find_element_by_link_text("3").click()
driver.find_element_by_link_text("4").click()
....
driver.find_element_by_link_text("Next").click()