driver.get('https://nameberry.com/popular_names/US')
boys_names = driver.find_elements_by_css_selector("""tr.even>.boys""")
girls_names = driver.find_elements_by_css_selector("""tr.even>.girls""")
# so this goes quickly
def list_gen(ls):
hugo = []
for i in ls:
hugo.append(i.text)
return hugo
i = time()
boys_names = list_gen(boys_names) # takes each <a> tag found before contained in boys_names and creates a list
# of names by taking everything CONTAINED (NOT attributes) between the opening and closing tag <a>
e = time()
print(e-i) # gives ~ 50 sec
i = time()
girls_names = list_gen(girls_names) # same thing but with girl names
e = time()
print(e-i) # gives ~ 80 sec
# those timings are consistent even though no. of boys and girls is the same
# which is also weird
# no. is 1000 btw so that quite alot
所以基本上我对为什么要花这么长时间感到困惑。我得出结论,由于某种原因,element.text花费的时间最多。有没有一种方法可以在不导入其他模块的情况下加快速度?
答案 0 :(得分:1)
我认为您的代码花了这么长时间的原因是因为list_gen
中的循环在循环时正在向网页发送一堆请求。如果在循环中设置一个断点,并在运行时在开发工具中查看浏览器的网络页面,则会看到大量请求从循环开始。我认为这是因为硒向下滚动时页面正在加载新元素。
据我所知,如果您希望它更快,您应该使用其他方法。我的建议是使用美丽汤。
from selenium import webdriver
from time import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
i = time()
driver.get('https://nameberry.com/popular_names/US')
soup = BeautifulSoup(driver.page_source, 'html5lib')
boys_names = [x.getText() for x in soup.find_all("td", {"class", "boys"})]
girls_names = [x.getText() for x in soup.find_all("td", {"class", "girls"})]
e = time()
print(e - i) # gives ~ 14 sec for me
这可以立即获取网页的整个源并进行解析,而不必使用css选择器返回的webdriver对象列表。
如果您不使用硒浏览器进行其他操作,而只想获取名称,则可以使用请求来更快地获取页面源,因为您无需加载硒浏览器。
import requests
i = time()
response = requests.get('https://nameberry.com/popular_names/US')
soup = BeautifulSoup(response.content, 'html5lib')
boys_names = [x.getText() for x in soup.find_all("td", {"class", "boys"})]
girls_names = [x.getText() for x in soup.find_all("td", {"class", "girls"})]
e = time()
print(e - i) # gives ~ 3.2 sec
答案 1 :(得分:0)
您可以使用Javascript在不到2秒的时间内返回值。
Array.from(document.querySelectorAll('tr.even>.girls')).map(function(element) {return element.textContent;})
只需在我们的控制台中运行它,您就会看到结果。
现在您可以在python硒脚本中调用此Javascript,例如
driver.execute_script("return Array.from(document.querySelectorAll('tr.even>.girls')).map(function(element) {return element.textContent;})")
尝试一下,让我们知道。