使用Selenium Python

时间:2018-02-11 18:17:38

标签: python selenium web-scraping

我正试图抓住[this] [1]网站,当点击下一页时,其网址不会改变。所以,我使用Selenium点击下一页,但这样做没有用。因为我的驱动程序即使在下一页被点击后仍然保持旧页面。有没有其他方法可以进入下一页并刮掉它?

    from selenium import webdriver 
    from selenium.webdriver.support.ui import WebDriverWait 
    from selenium.webdriver.support import expected_conditions as EC 
    from selenium.common.exceptions import TimeoutException
    from bs4 import BeautifulSoup 


    driver = webdriver.Safari()



        store_pages = []

    #10306 is total number of pages.
        for i in range (10306):
            Starting_url = 'site'

        driver.get(Starting_url)

        html = driver.page_source
        soup = BeautifulSoup(html, "lxml")

        print (store_pages.append(i))

        timeout = 20

        try:
            WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//*[@id='ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_lblDisclaimerMsg']")))
        except TimeoutException:
            print("Timed out waiting for page to load")
            driver.quit()

        nextpage_url = driver.find_element_by_name("ctl00$SPWebPartManager1$g_d6877ff2_42a8_4804_8802_6d49230dae8a$ctl00$imgbtnNext").click()
        timeout = 20
        wait = WebDriverWait(driver, 10).until(EC.text_to_be_present_in_element_value((By.CSS_SELECTOR, '#ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a > div.act_search_results > div.act_search_header'), "206113 Record(s) | Page [2 of 10306]"))

        NGO_element = driver.find_element_by_class_name("faq-sub-content exempted-result")
        NGO_name = NGO_element.find_elements_by_tag_name("h1")
        NGO_name_pancard = driver.find_elements_by_class_name("pan-id")
        NGO_data = NGO_element.find_elements_by_tag_name("ul")
        NGO_sub_data = NGO_element.find_elements_by_tag_name("li")

        for i, p, t in zip(NGO_name, NGO_name_pancard, NGO_data):
            n_name = i.text.replace(p.text, '')
            n_data = t.text 
            n_pan = p.text
            print ("Name of NGO:", n_name, "Fields of NGO:", n_data, "Pancard number:", n_pan)

        nextpage_url = driver.find_element_by_name("ctl00$SPWebPartManager1$g_d6877ff2_42a8_4804_8802_6d49230dae8a$ctl00$imgbtnNext").click()
         #timeout = 2

1 个答案:

答案 0 :(得分:0)

你需要确保当你到达下一页时,前一页的内容已经陈旧,否则你将有陈旧的元素错误或反复得到同样的东西。尝试以下方法,它应该让你到那里。其余的你可以自己修改。

$array = array();
while($row = mysqli_fetch_array($query1)){
    array_push($array, $row);
}