硒/ Beautiful Soup刮板在循环浏览一页后失败(Javascript)

时间:2018-08-19 19:38:56

标签: javascript python selenium web-scraping beautifulsoup

我正试图从《季节性食品指南》中获取有关食品季节性的数据,但遇到了麻烦。该网站具有相当简单的URL结构:

https://www.seasonalfoodguide.org/produce_name/state_name

我已经能够使用SeleniumBeautiful Soup从一页成功地抓取季节性信息,但是在随后的循环中,我要查找的文本部分实际上并未加载,因此我得到AttributeError: 'NoneType' object has no attribute 'text'。我知道这是因为months_list_raw由于该页面的'wheel-months-list'部分没有在第二个循环中加载而返回为空。代码如下。有什么想法吗?

for ingredient in produce_list:
    for state in state_list:

        # grab page content
        search_url = 'https://www.seasonalfoodguide.org/{}/{}'.format(ingredient,state)
        driver.get(search_url)
        page_soup = soup(driver.page_source, 'lxml')

        # grab list of months
        months_list_raw = page_soup.find('p',{'id':'wheel-months-list'})
        months_list = months_list_raw.text

2 个答案:

答案 0 :(得分:1)

页面在客户端呈现,这意味着当您打开页面时,正在向后端服务器发出另一个请求,以根据您选择的过滤器获取数据。因此,问题在于,当您打开页面并阅读HTML时,内容尚未完全加载。您可以做的最简单的事情是,在用Selenium打开页面之后要休眠一段时间,以等待页面完全加载。我已经通过在time.sleep(3)之后插入driver.get(search_url)来测试您的代码,并且工作正常。

答案 1 :(得分:0)

为防止错误发生并继续循环,您需要检查find_element_by_id(ICN_Feedback_3400653_125630)元素不是<a class="d2l-imagelink" id="ICN_Feedback_3444653_124440" href="javascript:void(0);" onclick="return false;" title="Edit comments for FIRSTNAME LASTNAME in a new window" aria-label="Edit comments for FIRSTNAME LASTNAME in a new window" role="button"> 时的情况。似乎某些农产品页面在某些状态下没有任何数据,因此您将需要在程序中按需要进行处理。

driver = webdriver.Chrome(chrome_path) 
driver.get(commentsPage)
assert "****" in driver.title

user = driver.find_element_by_name("userName")
user.clear()
user.send_keys("USERNAME")

pas = driver.find_element_by_name("password")
pas.clear()
pas.send_keys("PASSWORD")
user.send_keys(Keys.RETURN)

driver.get(commentsPage)

for i in toplist:

    icnFeedback = (""" "//a[@title='Enter comments for """+ i[0] + """ in a 
    new window']"  """)
    myElement = driver.find_element_by_xpath(icnFeedback)                       
    # find user by orgid
    driver.execute_script("arguments[0].click();", myElement)                 
    #clicks the feedback button

    time.sleep(2)
    iframes2 = driver.find_elements_by_tag_name("iframe")                     
    #looks for the iframes on main page
    driver.switch_to.frame(iframes2[1])                                       
    #this switches from main page to the iframe#2
    time.sleep(1)
    iframes3 = driver.find_elements_by_tag_name("iframe")                     
    #looks for the iframes inside iframe#2
    driver.switch_to.frame(iframes3[0])                                       
    #this switches from iframes#2 to iframe#3
    time.sleep(1)
    textBox = driver.find_element_by_id('tinymce')                            
    #finds textbox 

    comments = i[1] 
    textBox.clear()                                                           
    #clears previous text
    textBox.send_keys(comments)                                               
    #send comments
    time.sleep(2)
    driver.switch_to.default_content()                                        
    #switches out of all iframes
    iframes2 = driver.find_elements_by_tag_name("iframe")                     
    #looks for the iframes on main page
    driver.switch_to.frame(iframes2[1])                                       
    #this switches from main page to the iframe#2
    button = driver.find_element(By.XPATH, '//button[text()="Save"]').click() 
    #looks for save button
    time.sleep(1)