如何使用硒抓取网页-find_element_by_link_text

时间:2019-05-16 07:09:27

标签: python selenium

我正在尝试使用Selenium和 BeautifulSoup 来“点击” javascript.voidfind_element_by_link_text的返回值不是 NULL 。但是,通过查看browser.page_source不会更新任何内容。我不确定抓取是否成功

这是使用

的结果
PageTable = soup.find('table',{'id':'rzrqjyzlTable'})
print(PageTable)
 <table class="tab1" id="rzrqjyzlTable">
 <div id="PageNav" class="PageNav" style="">
 <div class="Page" id="PageCont">
  <a href="javascript:void(0);" target="_self" class="nolink">Previous</a>3<span class="at">1</span>
  <a href="javascript:void(0);" target="_self" title="Page 2">2</a>
  <a href="javascript:void(0);" target="_self" title="Page 3">3</a>
  <a href="javascript:void(0);" target="_self" title="Page 4">4</a>
  <a href="javascript:void(0);" target="_self" title="Page 5">5</a>
  <a href="javascript:void(0);" target="_self" title="Next group" class="next">...</a>
  <a href="javascript:void(0);" target="_self" title="Last Page">45</a>
  <a href="javascript:void(0);" target="_self" title="Page 2">Next Page</a>
  <span class="txt">&nbsp;&nbsp;Jump</span><input class="txt" id="PageContgopage">
  <a class="btn_link">Go</a></div>
                        </div>

点击下一页的代码如下所示

try:       
    page = browser.find_element_by_link_text(u'Next Page')
    page.click()    
    browser.implicitly_wait(3)
  except NoSuchElementException:
    print("NoSuchElementException")

  soup = BeautifulSoup(browser.page_source, 'html.parser')
  PageTable = soup.find('table',{'id':'rzrqjyzlTable'})
  print(PageTable )

我希望应该更新browser.page_source

2 个答案:

答案 0 :(得分:0)

单击下一页后,您可以重新加载网页。

代码:

driver.refresh()

或使用Java脚本执行器:

driver.execute_script("location.reload()")  

之后,您尝试像执行操作一样获取页面源。

希望这会有所帮助。

答案 1 :(得分:-1)

我的猜测是您要在重新加载页面(或子页面)之前提取源。我会尝试抓住“下一页”按钮,单击它,等待它过时(表明该页面正在重新加载),然后尝试拉出源代码。

page = browser.find_element_by_link_text(u'Next Page')
page.click()
wait.until(EC.staleness_of(page))
# the page should be loading/loaded at this point
# you may need to wait for a specific element to appear to ensure that it's loaded properly since it doesn't seem to be a full page load