硒:element.text很慢,我不知道为什么

时间:2019-03-07 23:28:16

标签: python-3.x selenium

driver.get('https://nameberry.com/popular_names/US')
boys_names = driver.find_elements_by_css_selector("""tr.even>.boys""")
girls_names = driver.find_elements_by_css_selector("""tr.even>.girls""")
# so this goes quickly

def list_gen(ls): 
    hugo = []
    for i in ls:
        hugo.append(i.text)
    return hugo


i = time()
boys_names = list_gen(boys_names) # takes each <a> tag found before contained in boys_names and creates a list
# of names by taking everything CONTAINED (NOT attributes) between the opening and closing tag <a>
e = time()
print(e-i) # gives ~ 50 sec

i = time()
girls_names = list_gen(girls_names) # same thing but with girl names
e = time()
print(e-i) # gives ~ 80 sec 
# those timings are consistent even though no. of boys and girls is the same
# which is also weird
# no. is 1000 btw so that quite alot

所以基本上我对为什么要花这么长时间感到困惑。我得出结论,由于某种原因,element.text花费的时间最多。有没有一种方法可以在不导入其他模块的情况下加快速度?

2 个答案:

答案 0 :(得分:1)

我认为您的代码花了这么长时间的原因是因为list_gen中的循环在循环时正在向网页发送一堆请求。如果在循环中设置一个断点,并在运行时在开发工具中查看浏览器的网络页面,则会看到大量请求从循环开始。我认为这是因为硒向下滚动时页面正在加载新元素。 据我所知,如果您希望它更快,您应该使用其他方法。我的建议是使用美丽汤。

from selenium import webdriver  
from time import time  
from bs4 import BeautifulSoup  

driver = webdriver.Chrome()  

i = time()  
driver.get('https://nameberry.com/popular_names/US')  
soup = BeautifulSoup(driver.page_source, 'html5lib')  

boys_names = [x.getText() for x in soup.find_all("td", {"class", "boys"})]  
girls_names = [x.getText() for x in soup.find_all("td", {"class", "girls"})]  

e = time()  
print(e - i) # gives ~ 14 sec for me

这可以立即获取网页的整个源并进行解析,而不必使用css选择器返回的webdriver对象列表。

如果您不使用硒浏览器进行其他操作,而只想获取名称,则可以使用请求来更快地获取页面源,因为您无需加载硒浏览器。

import requests  

i = time()  

response = requests.get('https://nameberry.com/popular_names/US')  
soup = BeautifulSoup(response.content, 'html5lib')  
boys_names = [x.getText() for x in soup.find_all("td", {"class", "boys"})]  
girls_names = [x.getText() for x in soup.find_all("td", {"class", "girls"})]  

e = time()  
print(e - i) # gives ~ 3.2 sec

答案 1 :(得分:0)

您可以使用Javascript在不到2秒的时间内返回值。

Array.from(document.querySelectorAll('tr.even>.girls')).map(function(element) {return element.textContent;})

只需在我们的控制台中运行它,您就会看到结果。

现在您可以在python硒脚本中调用此Javascript,例如

driver.execute_script("return Array.from(document.querySelectorAll('tr.even>.girls')).map(function(element) {return element.textContent;})")

尝试一下,让我们知道。