我遇到以下代码的问题
import re
from lxml import html
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import requests
import sys
import datetime
print ('start!')
print(datetime.datetime.now())
list_file = 'list2.csv'
#This should be the regular input list
url_list=["http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3"]
#This is an example input instead
binary = FirefoxBinary('C:/Program Files (x86)/Mozilla Firefox/firefox.exe')
#Read somewhere it could be a variable useful to supply but anyway, the program fails randomly at time with [WinError 6] Invalid Descriptor while having nothing different from when it is able to at least get the webpage; even when not able to perform further operation.
for page in url_list:
print(page)
browser = webdriver.Firefox(firefox_binary=binary)
#I tried this too to solve the [WinError 6] but it is not working
browser.get(page)
print ("TEST BEGINS")
soup=BS(browser.page_source,"lxml")
soup=soup.find("summaries")
# This fails here. It finds nothing, while there is a section id termed summaries. soup.find_all("p") works but i don't want all the p's outside of summaries
print(soup) #It prints "None" indeed.
print ("TEST ENDS")
我是正面的源代码,包括“摘要”。首先是
<li> <a href="#summaries" ng-click="scrollTo('summaries')">Summaries</a></li>
然后有
<section id="summaries" data-ga-label="Summaries" data-section="Summaries">
正如@alexce所建议的那样(Webscraping in python: BS, selenium, and None error),我试过了
summary = soup.find('section', attrs={'id':'summaries'})
(编辑:建议是_summaries,但我也测试了摘要)
但它也不起作用。 所以我的问题是: 为什么BS没有找到摘要,为什么当我连续使用脚本时,selenium会不断破坏(另一方面,重新启动控制台工作,但这很繁琐),或者包含四个以上实例的列表? 感谢
答案 0 :(得分:1)
此:
summary = soup.find('section', attrs={'id':'_summaries'})
搜索属性section
设置为id
的元素_summaries
:
<section id="_summary" />
页面中没有包含这些属性的元素
你想要的那个可能是<section id="summaries" data-ga-label="Summaries" data-section="Summaries">
。并且可以与:
results = soup.find('section', id_='summaries')
另外,请注意您使用Selenium的原因。如果您不转发cookie,页面将返回错误。因此,为了使用请求,您需要发送cookie。
我的完整代码:
1 from __future__ import unicode_literals
2
3 import re
4 import requests
5 from bs4 import BeautifulSoup as BS
6
7
8 data = requests.get(
9 'http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3',
10 cookies={
11 'nlbi_146342': '+fhjaf6NSntlOWmvFHlFeAAAAAAwHqv5tJUsy3kqgNQOt77C',
12 'visid_incap_146342': 'tEumui9aQoue4yMuu9tuUcly6VYAAAAAQUIPAAAAAABcQsCGxBC1gj0OdNFoMEx+',
13 'incap_ses_189_146342': 'bNY8PNPZJzroIFLs6nefAspy6VYAAAAAYlWrxz2UrYFlrqgcQY9AuQ=='
14 }).content
15
16 soup=BS(data)
17 results=soup.find_all(string=re.compile('summary', re.I))
18 print(results)
19 summary_re = re.compile('summary', re.I)
20 results = soup.find('section', id_='summaries')
21 print(results)
答案 1 :(得分:1)
该元素可能尚未出现在页面上。在使用BS解析页面源之前我会等待元素:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "summaries")))
soup = BS(driver.page_source,"lxml")
我注意到你从不打电话给driver.quit(),这可能是你遇到问题的原因。 因此,请务必调用它或尝试重用相同的会话。
为了使其更稳定和更高效,我会尝试尽可能地使用Selenium API,因为拉取和解析页面源代价很高。