我有一个LinkedIn蜘蛛。在我的本地计算机上运行正常,但是当我在Scrapinghub上部署时出现错误:
Error downloading <GET https://www.linkedin.com/>: Connection was refused by other side: 111: Connection refused.
Scrapinghub的完整日志为:
0: 2018-08-30 12:58:34 INFO Log opened.
1: 2018-08-30 12:58:34 INFO [scrapy.log] Scrapy 1.0.5 started
2: 2018-08-30 12:58:34 INFO [scrapy.utils.log] Scrapy 1.0.5 started (bot: facebook_stats)
3: 2018-08-30 12:58:34 INFO [scrapy.utils.log] Optional features available: ssl, http11, boto
4: 2018-08-30 12:58:34 INFO [scrapy.utils.log] Overridden settings: {'NEWSPIDER_MODULE': 'facebook_stats.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['facebook_stats.spiders'], 'RETRY_TIMES': 10, 'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408], 'BOT_NAME': 'facebook_stats', 'MEMUSAGE_LIMIT_MB': 950, 'DOWNLOAD_DELAY': 1, 'TELNETCONSOLE_HOST': '0.0.0.0', 'LOG_FILE': 'scrapy.log', 'MEMUSAGE_ENABLED': True, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'}
5: 2018-08-30 12:58:34 INFO [scrapy.log] HubStorage: writing items to https://storage.scrapinghub.com/items/341545/3/9
6: 2018-08-30 12:58:34 INFO [scrapy.middleware] Enabled extensions: CoreStats, TelnetConsole, MemoryUsage, LogStats, StackTraceDump, CloseSpider, SpiderState, AutoThrottle, HubstorageExtension
7: 2018-08-30 12:58:35 INFO [scrapy.middleware] Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
8: 2018-08-30 12:58:35 INFO [scrapy.middleware] Enabled spider middlewares: HubstorageMiddleware, HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
9: 2018-08-30 12:58:35 INFO [scrapy.middleware] Enabled item pipelines: CreditCardsPipeline
10: 2018-08-30 12:58:35 INFO [scrapy.core.engine] Spider opened
11: 2018-08-30 12:58:36 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
12: 2018-08-30 12:58:36 INFO TelnetConsole starting on 6023
13: 2018-08-30 12:59:32 ERROR [scrapy.core.scraper] Error downloading <GET https://www.linkedin.com/>: Connection was refused by other side: 111: Connection refused.
14: 2018-08-30 12:59:32 INFO [scrapy.core.engine] Closing spider (finished)
15: 2018-08-30 12:59:33 INFO [scrapy.statscollectors] Dumping Scrapy stats: More
16: 2018-08-30 12:59:34 INFO [scrapy.core.engine] Spider closed (finished)
17: 2018-08-30 12:59:34 INFO Main loop terminated.
我该如何解决?
答案 0 :(得分:2)
LinkedIn prohibits scraping:
禁止的软件和扩展
LinkedIn致力于维护其会员数据的安全,并确保其网站免受欺诈和滥用。为了保护会员的数据和我们的网站,我们不允许使用任何第三方软件,包括“抓取工具”,漫游器,浏览器插件或浏览器扩展程序(也称为“附加组件”),会刮擦,修改LinkedIn的网站的外观或自动进行活动。此类工具违反了User Agreement,包括但不限于第8.2节中列出的许多“不要”……
有理由认为他们可能会积极阻止Scrapinghub和类似服务的连接。