使用硒网格docker集群进行网络抓取

时间:2018-07-03 14:14:06

标签: selenium selenium-webdriver scrapy selenium-chromedriver selenium-grid

我正在硒网格docker上抓取网站。如果我仅使用一个铬节点,则表示如果我缩放铬硒网格的多个节点并且硒板再次停止工作,则硒网格将正常工作。一段时间后它会闪烁并显示大错误消息。

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    start_urls = ['https://google.com']

    def __init__(self):
        options = webdriver.ChromeOptions()

        options.add_argument('--headless')

        self.driver = webdriver.Remote(command_executor='http://localhost:5000/wd/hub',
            desired_capabilities=DesiredCapabilities.CHROME)


    def parse(self, response):
        data = self.driver.get(response.url)
        print(data,'/////////////')

然后我打开python shell并键入单个代码

Python 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from selenium import webdriver
>>> from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
>>> options = webdriver.ChromeOptions()
>>> options.add_argument('--headless')
>>> driver = webdriver.Remote(command_executor='http://localhost:5000/wd/hub',
...             desired_capabilities=DesiredCapabilities.CHROME)

您看到它在webdriver中停止了。远程.cursor只是长时间闪烁,然后显示大错误消息。我认为问题出在webdriver.Remote(command_executor ='http://localhost:5000/wd/hub', ... required_capabilities = DesiredCapabilities.CHROME)行。

任何人都可以为这个问题提供解决方案 请注意,如果我缩放多个节点(铬),则硒网格中只有一个节点(铬)会起作用。

这是长时间后的错误消息:

  

回溯(最近一次通话最后一次):文件“”,第1行,在      文件   “ /home/vicky/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py”,   第156行, init       self.start_session(功能,浏览器配置文件)文件“ /home/vicky/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py”,   第251行,在start_session中       响应= self.execute(Command.NEW_SESSION,参数)文件“ /home/vicky/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py”,   第320行,执行       self.error_handler.check_response(响应)文件“ /home/vicky/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py”,   第242行,在check_response中       引发exception_class(消息,屏幕,堆栈跟踪)selenium.common.exceptions.WebDriverException:消息:错误   转发新会话转发请求时出错连接到   172.18.0.8:5555 [/172.18.0.8]失败:连接超时(连接超时)Stacktrace:       在org.openqa.grid.web.servlet.handler.RequestHandler.process(RequestHandler.java:117)       在org.openqa.grid.web.servlet.DriverServlet.process(DriverServlet.java:84)       在org.openqa.grid.web.servlet.DriverServlet.doPost(DriverServlet.java:68)       在javax.servlet.http.HttpServlet.service(HttpServlet.java:707)       在javax.servlet.http.HttpServlet.service(HttpServlet.java:790)       在org.seleniumhq.jetty9.servlet.ServletHolder.handle(ServletHolder.java:860)       在org.seleniumhq.jetty9.servlet.ServletHandler.doHandle(ServletHandler.java:535)       在org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)       在org.seleniumhq.jetty9.server.session.SessionHandler.doHandle(SessionHandler.java:1595)       在org.seleniumhq.jetty9.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)       在org.seleniumhq.jetty9.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)       在org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)       在org.seleniumhq.jetty9.servlet.ServletHandler.doScope(ServletHandler.java:473)       在org.seleniumhq.jetty9.server.session.SessionHandler.doScope(SessionHandler.java:1564)       在org.seleniumhq.jetty9.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)       在org.seleniumhq.jetty9.server.handler.ContextHandler.doScope(ContextHandler.java:1155)       在org.seleniumhq.jetty9.server.handler.ScopedHandler.handle(ScopedHandler.java:141)       在org.seleniumhq.jetty9.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)       在org.seleniumhq.jetty9.server.Server.handle(Server.java:530)       在org.seleniumhq.jetty9.server.HttpChannel.handle(HttpChannel.java:347)       在org.seleniumhq.jetty9.server.HttpConnection.onFillable(HttpConnection.java:256)       在org.seleniumhq.jetty9.io.AbstractConnection $ ReadCallback.succeeded   (AbstractConnection.java:279)       在org.seleniumhq.jetty9.io.FillInterest.fillable(FillInterest.java:102)       在org.seleniumhq.jetty9.io.ChannelEndPoint $ 2.run(ChannelEndPoint.java:124)       在org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.doProduce   (EatWhatYouKill.java:247)       在org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.produce   (EatWhatYouKill.java:140)       在org.seleniumhq.jetty9.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)       在org.seleniumhq.jetty9.util.thread.ReservedThreadExecutor $ ReservedThread.run   (ReservedThreadExecutor.java:382)       在org.seleniumhq.jetty9.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)       在org.seleniumhq.jetty9.util.thread.QueuedThreadPool $ 2.run(QueuedThreadPool.java:626)

当使用多个节点时,我还附上了硒网格控制台屏幕截图。 link here to see the picture

1 个答案:

答案 0 :(得分:0)

您似乎正在使用Firefox启动新的Selenium节点,但您的测试专门针对Chrome。

我建议使用Zalenium设置您的Selenium Grid: https://github.com/zalando/zalenium