在Python中刮擦更新的JavaScript页面

时间:2018-04-17 00:06:17

标签: python html selenium

我一直致力于一项研究项目,该项目旨在获取巴西Hemeroteca的参考文章列表(所需的页面参考:http://memoria.bn.br/DocReader/720887x/839,需要从下页的两个隐藏元素中收集:http://memoria.bn.br/DocReader/docreader.aspx?bib=720887x&pasta=ano%20189&pesq=Milho)。几个星期前我问了一个问题,这个问题得到了解答,我能够把事情搞得一团糟,但现在我遇到了一个新问题,我不确定如何绕过它。

问题是,在填写第一个表单后,页面会重定向到第二个页面,这是一个支持JavaScript / AJAX的页面,我需要通过所有匹配进行后续处理,这是通过单击a来完成的。页面顶部的按钮。我遇到的问题是,当点击下一页按钮时,我正在处理页面上正在更新的元素,这会导致Stale Elements。我试图实现一些代码来检测这种“陈旧”效果何时发生以表明页面已经改变,但这并没有提供太多运气。这是我实施的代码:

import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time

saveDir = "C:/tmp"

print("Opening Page...")

browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)

print("Searching for elements")

fLink = ""
fails = 0

frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")

search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"

xpath_form = "//input[@name=\'PesquisarBtn1\']"
xpath_journal = "//li[text()=\'"+search_journal+"\']"
xpath_timeRange = "//input[@name=\'PeriodoCmb1\' and not(@disabled)]"
xpath_timeSelect = "//li[text()=\'"+search_timeRange+"\']"
xpath_searchTerm = "//input[@name=\'PesquisaTxt1\']"

print("Locating Journal/Periodical")
journal.click()
dropDownJournal = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.XPATH, xpath_journal)))
dropDownJournal.click()
print("Waiting for Time Selection")
try:
    timeRange = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeRange)))
    timeRange.click()
    time.sleep(1)
    print("Locating Time Range")    
    dropDownTime = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeSelect)))
    dropDownTime.click()
    time.sleep(1)
except:
    print("Failed...")
print("Adding Search Term")

searchTerm = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_searchTerm)))
searchTerm.clear()
searchTerm.send_keys(search_text)
time.sleep(5)

print("Perform search")

submitButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_form)))
submitButton.click()

# Wait for the second page to load, pull what we need from it.
download_list = []

browser.switch_to_window(browser.window_handles[-1])
print("Waiting for next page to load...")

matches = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//span[@id=\'OcorNroLbl\']")))
print("Next page ready, found match element... counting")
countText = matches.text
countTotal = int(countText[countText.find("/")+1:])
print("A total of " + str(countTotal) + " matches have been found, standing by for page load.")
for i in range(1, countTotal+2):               
    print("Waiting for page " + str(i-1) + " to load...")
    while(fLink in download_list):
        try:
            jIDElement = browser.find_element_by_xpath("//input[@name=\'HiddenBibAlias\']")
            jPageElement = browser.find_element_by_xpath("//input[@name=\'hPagFis\']")
            fLink = "http://memoria.bn.br/DocReader/" + jIDElement.get_attribute('value') + "/" + jPageElement.get_attribute('value') + "&pesq=" + search_text         
        except:
            fails += 1
            time.sleep(1)
            if(fails == 10):
                print("Locked on a page, attempting to push to next.")
                nextPageButton = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
                nextPageButton.click()                    
            #raise
        while(fLink == ""):
            jIDElement = browser.find_element_by_xpath("//input[@name=\'HiddenBibAlias\']")
            jPageElement = browser.find_element_by_xpath("//input[@name=\'hPagFis\']")
            fLink = "http://memoria.bn.br/DocReader/" + jIDElement.get_attribute('value') + "/" + jPageElement.get_attribute('value') + "&pesq=" + search_text                     
    fails = 0
    print("Link obtained: " + fLink)
    download_list.append(fLink)

    if(i != countTotal):
        print("Moving to next page...")
        nextPageButton = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
        nextPageButton.click()

我试图用这个块解决两个“错误”。首先,在循环中总是跳过第一页(IE:fLink =“”),即使在那里有测试,我也不确定为什么会这样。另一个错误是代码将完全随机地挂在特定页面上,唯一的出路就是破坏代码执行。

这个区块已被修改了几次,所以我知道它不是最“优雅”的解决方案,但我开始没时间了。

1 个答案:

答案 0 :(得分:0)

After taking a day off from this to think about it (And get some more sleep), I was able to figure out what was going on. The above code has three "big faults". This first is that it does not handle the StaleElementException versus the NoSuchElementException, which can occur while the page is shifting. Secondly, the loop condition was checking iteratively that a page wasn't in the list, which when entering the first run allowed the blank condition to load in directly as the loop was never executed on the first run (Should have used a do-while there, but I made more modifications). Finally, I made the silly error of only checking if the first hidden element was changing, when in fact that is the journal ID, and is pretty much constant through all.

The revisions began with an adaptation of a code on this other SO article to implement a "hold" condition until either one of the hidden elements changed:

from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import NoSuchElementException
def hold_until_element_changed(driver, element1_xpath, element2_xpath, old_element1_text, old_element2_text):
    while True:
        try:
            element1 = driver.find_element_by_xpath(element1_xpath)
            element2 = driver.find_element_by_xpath(element2_xpath)
            if (element1.get_attribute('value') != old_element1_text) or (element2.get_attribute('value') != old_element2_text):
                break
        except StaleElementReferenceException:
            break
        except NoSuchElementException:
            return False
        time.sleep(1)
    return True    

I then modified the original looping condition, going back to the original "for loop" counter I had created without an internal loop, instead shooting a call to the above function to create the "hold" until the page had flipped, and voila, worked like a charm. (NOTE: I also upped the timeout on the next page button as this is what caused the locking condition)

for i in range(1, countTotal+1):               
    print("Waiting for page " + str(i) + " to load...")
    bibxpath = "//input[@name=\'HiddenBibAlias\']"
    pagexpath = "//input[@name=\'hPagFis\']"
    jIDElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, bibxpath)))
    jPageElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, pagexpath)))
    jidtext = jIDElement.get_attribute('value')
    jpagetext = jPageElement.get_attribute('value')
    fLink = "http://memoria.bn.br/DocReader/" + jidtext + "/" + jpagetext + "&pesq=" + search_text         
    print("Link obtained: " + fLink)
    download_list.append(fLink)

    if(i != countTotal):
        print("Moving to next page...")
        nextPageButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
        nextPageButton.click()
        # Wait for next page to be ready
        change = hold_until_element_changed(browser, bibxpath, pagexpath, jidtext, jpagetext)
        if(change == False):
            print("Something went wrong.")

All in all, a good exercise in thought and some helpful links for me to consider when posting future questions. Thanks!