BS无法在Selenium检索页面中获取节ID

时间:2016-03-16 14:37:28

标签: python selenium web-scraping beautifulsoup python-requests

我遇到以下代码的问题

import re
from lxml import html
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import requests
import sys
import datetime

print ('start!')
print(datetime.datetime.now())

list_file = 'list2.csv'
#This should be the regular input list

url_list=["http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3"]
#This is an example input instead

binary = FirefoxBinary('C:/Program Files (x86)/Mozilla Firefox/firefox.exe')
#Read somewhere it could be a variable useful to supply but anyway, the program fails randomly at time with [WinError 6] Invalid Descriptor while having nothing different from when it is able to at least get the webpage; even when not able to perform further operation.

for page in url_list:
    print(page)
    browser = webdriver.Firefox(firefox_binary=binary)
    #I tried this too to solve the [WinError 6] but it is not working
    browser.get(page)
    print ("TEST BEGINS")
    soup=BS(browser.page_source,"lxml")
    soup=soup.find("summaries")
    # This fails here. It finds nothing, while there is a section id termed summaries. soup.find_all("p") works but i don't want all the p's outside of summaries
    print(soup) #It prints "None" indeed.
     print ("TEST ENDS")

我是正面的源代码,包括“摘要”。首先是

 <li> <a href="#summaries" ng-click="scrollTo('summaries')">Summaries</a></li>

然后有

 <section id="summaries" data-ga-label="Summaries" data-section="Summaries">

正如@alexce所建议的那样(Webscraping in python: BS, selenium, and None error),我试过了

 summary = soup.find('section', attrs={'id':'summaries'})

(编辑:建议是_summaries,但我也测试了摘要)

但它也不起作用。 所以我的问题是: 为什么BS没有找到摘要,为什么当我连续使用脚本时,selenium会不断破坏(另一方面,重新启动控制台工作,但这很繁琐),或者包含四个以上实例的列表? 感谢

2 个答案:

答案 0 :(得分:1)

此:

  

summary = soup.find('section', attrs={'id':'_summaries'})

搜索属性section设置为id的元素_summaries

<section id="_summary" />

页面中没有包含这些属性的元素 你想要的那个可能是<section id="summaries" data-ga-label="Summaries" data-section="Summaries">。并且可以与:

匹配
 results = soup.find('section', id_='summaries')

另外,请注意您使用Selenium的原因。如果您不转发cookie,页面将返回错误。因此,为了使用请求,您需要发送cookie。

我的完整代码:

  1 from __future__ import unicode_literals
  2 
  3 import re
  4 import requests
  5 from bs4 import BeautifulSoup as BS
  6 
  7 
  8 data = requests.get(
  9     'http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3',
 10     cookies={
 11         'nlbi_146342': '+fhjaf6NSntlOWmvFHlFeAAAAAAwHqv5tJUsy3kqgNQOt77C',
 12         'visid_incap_146342': 'tEumui9aQoue4yMuu9tuUcly6VYAAAAAQUIPAAAAAABcQsCGxBC1gj0OdNFoMEx+',
 13         'incap_ses_189_146342': 'bNY8PNPZJzroIFLs6nefAspy6VYAAAAAYlWrxz2UrYFlrqgcQY9AuQ=='
 14     }).content
 15 
 16 soup=BS(data)
 17 results=soup.find_all(string=re.compile('summary', re.I))
 18 print(results)
 19 summary_re = re.compile('summary', re.I)
 20 results = soup.find('section', id_='summaries')
 21 print(results)

答案 1 :(得分:1)

该元素可能尚未出现在页面上。在使用BS解析页面源之前我会等待元素:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "summaries")))
soup = BS(driver.page_source,"lxml")

我注意到你从不打电话给driver.quit(),这可能是你遇到问题的原因。 因此,请务必调用它或尝试重用相同的会话。

为了使其更稳定和更高效,我会尝试尽可能地使用Selenium API,因为拉取和解析页面源代价很高。