从python / selenium与javascript可滚动容器交互

时间:2016-03-21 16:29:40

标签: javascript python selenium web-scraping census

我正在尝试使用Selenium / Python自动从http://factfinder.census.gov下载数据集。我是Javascript的新手,所以如果这是一个容易解决的问题,请道歉。我现在正在编写代码的开头部分,它应该:

  1. 转发here
  2. 点击"主题"按钮
  3. 一次"主题"单击并加载新页面,单击"数据集"
  4. 选择我需要的数据集,最好是通过索引(子)表。
  5. 我被困在第3步。这是截图;好像我想访问div w / id" scrollable_container_topics"然后迭代或索引以获取其子节点(在这种情况下,我想要最后一个子节点)。我已经尝试使用script_execute,然后按id和类名定位元素,但到目前为止还没有任何工作。我很感激任何指示。

    enter image description here

    这是我的代码:

    import os
    import re
    import time
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support.select import Select
    
    
    # A list of all the variables we want to extract; corresponds to "Topics" field on site
    topics = ["B03003", "B05001"]
    
    # A list of all the states we want to extract data for (currently, strings; is there a numeric code?)
    states = ["New Jersey", "Georgia"]
    
    # A vector of all the years we want to extract data for [lower, upper) *Note* this != range of years covered by data
    years = range(2009, 2010)
    
    # Define the class
    class CensusSearch:
    
        # Initialize and set attributes of the query
        def __init__(self, topic, state, year):
    
            """
            :type topic: str
            :type state: str
            :type year: int
            """
            self.topic = topic
            self.state = state
            self.year = year
    
    
        def setUp(self):
    
           # self.driver = webdriver.Chrome("C:/Python34/Scripts/chromedriver.exe")
            self.driver = webdriver.Firefox()
    
        def extractData(self):
            driver = self.driver
            driver.set_page_load_timeout(1000000000000)
            driver.implicitly_wait(100)
    
            # Navigate to site; this url = after you have already chosen "Advanced Search"
            driver.get("http://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t")
            driver.implicitly_wait(10)
    
            # FIlter by dataset (want the ACS 1, 3, and 5-year estimates)
    
            driver.execute_script("document.getElementsByClassName('leftnav_btn')[0].click()") # click the "Topics" button
            driver.implicitly_wait(20) 
    
            # This is where I am stuck; I've tried the following: 
            getData = driver.find_element_by_id("ygtvlabelel172")
            getData.click()
            driver.implicitly_wait(10)
    
    
            # Filter geographically: select all counties in the United States and Puerto Rico
            # Click "Geographies" button
            driver.execute_script("document.getElementsByClassName('leftnav_btn')[1].click()")
            driver.implicitly_wait(10)
    
            drop_down = driver.find_element_by_class_name("popular_summarylevel")
            select_box = Select(drop_down)
            select_box.select_by_value("050")
    
        # Once "Geography" is clicked, select "County - 050" from the drop-down menu; then select "All US + Puerto Rico"
        drop_down_counties = driver.find_element_by_id("geoAssistList")
        select_box_counties = Select(drop_down_counties)
        select_box_counties.select_by_index(1)
    
        # Click the "ADD TO YOUR SELECTIONS" button
        driver.execute_script("document.getElementsByClassName('button-g')[0].click()")
        driver.implicitly_wait(10)
    
        def tearDown(self):
            self.driver.quit()
    
        def main(self):
            #print(getattr(self))
            print(self.state)
            print(self.topic)
            print(self.year)
            self.setUp()
            self.extractData()
            self.tearDown()
    
    
    for a in topics:
        for b in states:
            for c in years:
                query = CensusSearch(a, b, c)
                query.main()
    
    print("done")
    

1 个答案:

答案 0 :(得分:1)

要解决的几件事:

  • 您不必使用there方法 - selenium拥有自己的方法来定位页面上的元素
  • 无需操纵隐式等待(另外,请确保您了解调用document.getElement..不会表现为implicitly_wait() - 您不会立即延迟时间)或页面加载超时这种情况 - 在页面上执行操作之前只需使用Explicit Waits

这是一个工作代码,点击“主题”,然后点击“数据集”:

time.sleep()