Question

im是webscraping和python的新功能。在完成之前，我已经完成了脚本。我在这一步中做的基本上是相同的事情，但运行速度较慢。这是我的代码：

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
import time

start = time.time()
opp = Options()
opp.add_argument('-headless')
browser = webdriver.Firefox(executable_path = "/Users/0581279/Desktop/L&S/Watchlist/geckodriver", options=opp)
browser.delete_all_cookies()
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")

c = browser.page_source
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("span", {"class": "fieldValue__2d582aa7"})
price = all[6].text
browser.quit()
print(price)
end = time.time()
print(end-start)

有时单个页面最多可能需要2分钟才能加载。我也只是在网上抓彭博社。任何帮助，将不胜感激:)

Answer 1

因此，我对您的代码进行了一些更改，几乎可以立即加载它，我使用已安装的chrome驱动程序，然后运行以下代码。

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium
import time

start = time.time()
browser = webdriver.Chrome("/Users/XXXXXXXX/Desktop/Programming/FacebookControl/package/chromedriver")
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")

c = browser.page_source
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("span", {"class": "fieldValue__2d582aa7"})
price = all[6].text
browser.quit()
print(price)
end = time.time()
print(end-start)

在测试过程中，他们确实阻止了我，可能要不时更改标题。它也打印了价格。

chromedriver链接http://chromedriver.chromium.org/

希望这会有所帮助。

输出是这样的：

34.54
7.527994871139526

Answer 2

硒影响某些参数，例如：

If the site is slow, the Selenium script is slow.

If the performance of the internet connection is not good, the Selenium script is slow.

If the computer running the script is not performing well, the Selenium script is slow.

这些情况通常不在我们手中。但是编程是。提高速度的一种方法是阻止图像加载（如果我们不使用它）。阻止加载图像会影响运行时。这是阻止它的方法：

opp.add_argument('--blink-settings=imagesEnabled=false')

打开驱动程序时，无需再次使用BeautifulSoap函数来获取数据。 Selenium函数提供了它。尝试下面的代码，Selenium会更快

from selenium import webdriver

from selenium.webdriver.firefox.options import Options
import time

start = time.time()
opp = Options()
opp.add_argument('--blink-settings=imagesEnabled=false')

driver_path = r'Your driver path'
browser = webdriver.Chrome(executable_path=driver_path , options=opp)

browser.delete_all_cookies()
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")

get_element = browser.find_elements_by_css_selector("span[class='fieldValue__2d582aa7']")


print(get_element[6].text)
browser.quit()

end = time.time()
print(end-start)

Answer 3

使用requests和BeautifulSoup可以方便快捷地抓取信息。此处的代码用于获取Bloomberg的MSGFINA:LX的关键统计信息：

import requests
from bs4 import BeautifulSoup

headers = {
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/72.0.3626.119 Safari/537.36',
    'DNT': '1'
}

response = requests.get('https://www.bloomberg.com/quote/MSGFINA:LX', headers=headers)
page = BeautifulSoup(response.text, "html.parser")

key_statistics = page.select("div[class^='module keyStatistics'] div[class^='rowListItemWrap']")
for key_statistic in key_statistics:
    fieldLabel = key_statistic.select_one("span[class^='fieldLabel']")
    fieldValue = key_statistic.select_one("span[class^='fieldValue']")
    print("%s: %s" % (fieldLabel.text, fieldValue.text))

Selenium对我来说真的很慢，我的代码有问题吗？

3 个答案: