Question

您好！的

使用scrapinghub，shub-image，selenuim + phantomjs，crawlera的人的问题。 英语技能不好，抱歉

我需要抓取有很多JS代码的网站。所以我用scrapy + selenium。它应该在Scrapy Cloud上运行。我写了蜘蛛，它使用scrapy + selenuim + phantomjs并在我的本地机器上运行它。一切都好。然后我使用shub-image将项目部署到Scrapy云。部署还可以。但结果 webdriver.page_source是不同的。在本地，没关系（带有题字的HTML - 403，请求200 http）在云端正常。然后我决定使用crawlera acc。我添加了它：

service_args = [
            '--proxy="proxy.crawlera.com:8010"',
'--proxy-type=https',
'--proxy-auth="apikey"',
]

for Windows（local）

self.driver = webdriver.PhantomJS(executable_path=r'D:\programms\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_args=service_args)

用于泊坞窗实例

self.driver = webdriver.PhantomJS(executable_path=r'/usr/bin/phantomjs', service_args=service_args, desired_capabilities=dcap)

再次在当地一切都好。云不行。我检查了cralwera信息。没关系。请求从（本地和云）发送。

再次注意：相同的代理（crawlera）。在Windows响应： 200 http，html与正确的代码

ScrapyCloud（docker instance）的响应： 200 http，html，题字403（禁止）

我不明白这是错的。我认为这可能是phantomjs版本（Windows，Linux）之间的差异。

有什么想法吗？

Spider从本地机器和Scrapy Cloud（phantomjs + selenium + crawlera）返回不同的结果

0 个答案: