我正在尝试使用Selenium和Python来抓取网站的几个页面,但我的代码却一遍又一遍。我希望能够在每页底部给出的值框中输入页码。截至目前,我的代码确实输入了页码,但在加载新页面后它就会中断。我已经能够只抓第一页,一旦第二页加载,代码就会中断。
这是我的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
driver = webdriver.Safari()
wait = WebDriverWait(driver, 1)
driver.get("http://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx")
call_names = {"Address": "Address", "State": "State", "City": "City", "Chief Commissioner of Income Tax Cadre Controlling Authority (CCIT- CCA) / DGIT (Exemptions)":"CCIT_DGIT_Exemptions", "Chief Commissioner of Income Tax (CCIT)":"CCIT", "Commissioner of Income Tax (CIT)": "CIT","Approved under Section": "Approved_under_Section", "Date of Order (DD/MM/YYYY)": "Date_of_order", "Date of Withdrawal/Cancellation (DD/MM/YYYY)":"Date_of_withdrawal", "Date of Expiry (DD/MM/YYYY)": "Date_of_Expiry", "Remarks": "Remarks"}
while True:
for elem in wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"faq-sub-content exempted-result"))):
listofIDstoScrape = []
name = elem.find_elements_by_class_name("fc-blue fquph")
pancard = elem.find_elements_by_class_name("pan-id")
details = driver.find_elements_by_class_name("exempted-detail")
for i in details:
pan = i.text
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'li')))
for n, p, key in zip(name, pancard, details):
main_list = {"Name": (n.text.replace(p.text,'')), "Pancard": p.text}
for elem_li in key.find_elements_by_tag_name("li"):
main_list[call_names [elem_li.find_element_by_tag_name('strong').text]] = elem_li.find_element_by_tag_name('span').text
print (main_list)
try:
for k in range(2,10):
myElem = WebDriverWait(driver, 1).until(EC.presence_of_element_located((By.ID, "ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_txtPageNumber")))
myElem.send_keys(str(k))
myElem.send_keys(Keys.RETURN)
print ("Page is ready!")
break
except TimeoutException:
print ("Loading took too much time!")
这是错误:
--------------------------------------------------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
StaleElementReferenceException Traceback (most recent call last)
<ipython-input-66-aa6debbcbeae> in <module>()
32
33 for elem_li in key.find_elements_by_tag_name("li"):
---> 34 main_list[call_names [elem_li.find_element_by_tag_name('strong').text]] = elem_li.find_element_by_tag_name('span').text
35
36 print (main_list)
/anaconda/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py in find_element_by_tag_name(self, name)
230 - name - name of html tag (eg: h1, a, span)
231 """
--> 232 return self.find_element(by=By.TAG_NAME, value=name)
233
234 def find_elements_by_tag_name(self, name):
/anaconda/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py in find_element(self, by, value)
516
517 return self._execute(Command.FIND_CHILD_ELEMENT,
--> 518 {"using": by, "value": value})['value']
519
520 def find_elements(self, by=By.ID, value=None):
/anaconda/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py in _execute(self, command, params)
499 params = {}
500 params['id'] = self._id
--> 501 return self._parent.execute(command, params)
502
503 def find_element(self, by=By.ID, value=None):
/anaconda/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py in execute(self, driver_command, params)
309 response = self.command_executor.execute(driver_command, params)
310 if response:
--> 311 self.error_handler.check_response(response)
312 response['value'] = self._unwrap_value(
313 response.get('value', None))
/anaconda/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
235 elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
236 raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 237 raise exception_class(message, screen, stacktrace)
238
239 def _value_or_default(self, obj, key, default):
StaleElementReferenceException: Message: An element command failed because the referenced element is no longer available.
这就是输出的样子:
{'Name': 'INDIA INCLUSION FOUNDATION', 'Pancard': 'AABTI3598J', 'Address': 'No.250/1, 16th and 17th Cross, \nSampige Road, Malleshwaram,\nBangalore-560003.', 'State': 'KARNATAKA', 'City': 'BANGALORE', 'CCIT_DGIT_Exemptions': 'PR.CCIT BENGALURU', 'CCIT': 'CCIT(E) NEW DELHI', 'CIT': 'CIT(E) BENGALURU', 'Approved_under_Section': '12A', 'Date_of_order': '30/03/3017', 'Date_of_withdrawal': ' - ', 'Date_of_Expiry': ' - ', 'Remarks': ' - '}
答案 0 :(得分:0)
StaleElementReferenceException :
在您的情况下,对第35行中的find_element_by_tag_name()
之一抛出此异常。
确保元素存在。如果存在,请尝试在找到元素之前等待元素一段时间。
答案 1 :(得分:0)
我遇到了类似StaleElementReference异常的问题。
这里的问题是,在提供下一个页码并发送Keys.RETURN后,Selenium会找到您正在等待的元素,但这是旧页面的元素,在加载下一页后,这些元素未连接到Dom已经被新页面替换了,但是Selenium将与前一页面的元素进行交互,这些元素不再附加到Dom给出StaleElement异常。
按下Keys.RETURN后,您必须等到下一页完全加载后再重新开始循环。这必须是其他东西,然后是presence_of_all_elements_located((By.CLASS_NAME,“faq-sub-content exempted-result”)。
对您而言,一个好的策略可能是等待您导航到的页面的页面选择器具有“NumericalPagerSelected”类。如何等待具有特定值属性的元素在此处描述:Using selenium webdriver to wait the attribute of element to change value
请参阅:StaleElementException when Clicking on a TableRow in an Angular WebPage我是如何解决的。