在2个本地Ubuntu实例上运行2个RabbitMQ worker和2个Scrapyd守护进程,其中一个rabbitmq worker无法正常工作

时间:2017-09-11 01:15:57

标签: django scrapy rabbitmq scrapyd

我目前正致力于构建" Scrapy蜘蛛控制面板"我正在测试[分布式多用户Scrapy蜘蛛控制面板] https://github.com/aaldaber/Distributed-Multi-User-Scrapy-System-with-a-Web-UI上提供的现有解决方案。

我正在尝试在我的本地Ubuntu Dev Machine上运行它,但是遇到了scrapd守护进程的问题。 其中一名工人linkgenerator正在工作但scraper因为工人1无法工作。 我无法弄清楚为什么scrapyd不会在另一个本地实例上运行。

背景有关配置的信息。

该应用程序捆绑了Django,Scrapy,MongoDB管道(用于保存已删除的项目)和RabbitMQ的Scrapy调度程序(用于在工作程序之间分发链接)。我有2个本地Ubuntu实例,其中Django,MongoDB,Scrapyd守护程序和RabbitMQ服务器在 Instance1 上运行。 在另一个Scrapyd守护程序上运行 Instance2 。 RabbitMQ Workers:

  
      
  • linkgenerator
  •   
  • worker1
  •   

实例的IP配置:

  
      
  • IP适用于本地Ubuntu Instance1:192.168.0.101
  •   
  • 本地Ubuntu Instance2的IP:192.168.0.106
  •   

使用的工具清单:

  
      
  • MongoDB服务器
  •   
  • RabbitMQ服务器
  •   
  • Scrapy Scrapyd API
  •   
  • 安装了Scrapy并在本地Ubuntu Instance1上运行scrapyd守护程序的一个RabbitMQ linkgenerator 工作者(WorkerName:linkgenerator)服务器:192.168.0.101
  •   
  • 安装了Scrapy并在本地Ubuntu Instance2上运行scrapyd守护程序的另一个RabbitMQ scraper worker(WorkerName:worker1)服务器:192.168.0.106
  •   

Instance1:192.168.0.101

"的Instance1"运行Django,RabbitMQ,scrapyd守护程序服务器 - IP:192.168.0.101

Instance2:192.168.0.106

在instance2上安装Scrapy并运行scrapyd守护程序

Scrapy控制面板UI快照:

from snapshot, control panel outlook can be been seen, there are two workers, linkgenerator worked successfully but worker1 did not, the logs given in the end of the post

RabbitMQ状态信息

linkgenerator worker 可以成功将消息推送到RabbitMQ队列, linkgenerator spider 生成start_urls for"刮刀spider *被scraper消耗(worker1 ),这是行不通的,请在帖子末尾看到worker1的日志

RabbitMQ设置

以下文件包含MongoDB和RabbitMQ的设置:

SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = 'ScrapyDevU79'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'

MONGODB_PUBLIC_ADDRESS = 'OneScience:27017'  # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'localhost:27017'  # Actual uri to connect to DB
MONGODB_USER = 'tariq'
MONGODB_PASSWORD = 'toor'
MONGODB_SHARDED = True
MONGODB_BUFFER_DATA = 100

# Set your link generator worker address here
LINK_GENERATOR = 'http://192.168.0.101:6800'
SCRAPERS = ['http://192.168.0.106:6800']
LINUX_USER_CREATION_ENABLED = False  # Set this to True if you want a linux user account
linkgenerator scrapy.cfg设置:
[settings]
default = tester2_fda_trial20.settings
[deploy:linkgenerator]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scraper scrapy.cfg设置:
[settings]
default = tester2_fda_trial20.settings

[deploy:worker1]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
Instance1的scrapyd.conf文件设置(192.168.0.101)

cat /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir   = /var/lib/scrapyd/eggs
dbs_dir    = /var/lib/scrapyd/dbs
items_dir  = /var/lib/scrapyd/items
logs_dir   = /var/log/scrapyd

max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
Instance2的scrapyd.conf文件设置(192.168.0.106)

cat /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir   = /var/lib/scrapyd/eggs
dbs_dir    = /var/lib/scrapyd/dbs
items_dir  = /var/lib/scrapyd/items
logs_dir   = /var/log/scrapyd

max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
RabbitMQ状态

sudo service rabbitmq-server status

[sudo] password for mtaziz:
Status of node rabbit@ScrapyDevU79
[{pid,53715},
{running_applications,
   [{rabbitmq_shovel_management,
        "Management extension for the Shovel plugin","3.6.11"},
    {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.6.11"},
    {rabbitmq_management,"RabbitMQ Management Console","3.6.11"},
    {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.11"},
    {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.11"},
    {rabbit,"RabbitMQ","3.6.11"},
    {os_mon,"CPO  CXC 138 46","2.2.14"},
    {cowboy,"Small, fast, modular HTTP server.","1.0.4"},
    {ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
    {ssl,"Erlang/OTP SSL application","5.3.2"},
    {public_key,"Public key infrastructure","0.21"},
    {cowlib,"Support library for manipulating Web protocols.","1.0.2"},
    {crypto,"CRYPTO version 2","3.2"},
    {amqp_client,"RabbitMQ AMQP Client","3.6.11"},
    {rabbit_common,
        "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
        "3.6.11"},
    {inets,"INETS  CXC 138 49","5.9.7"},
    {mnesia,"MNESIA  CXC 138 12","4.11"},
    {compiler,"ERTS  CXC 138 10","4.9.4"},
    {xmerl,"XML parser","1.3.5"},
    {syntax_tools,"Syntax tools","1.6.12"},
    {asn1,"The Erlang ASN1 compiler version 2.0.4","2.0.4"},
    {sasl,"SASL  CXC 138 11","2.3.4"},
    {stdlib,"ERTS  CXC 138 10","1.19.4"},
    {kernel,"ERTS  CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,
   "Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
{memory,
   [{connection_readers,0},
    {connection_writers,0},
    {connection_channels,0},
    {connection_other,6856},
    {queue_procs,145160},
    {queue_slave_procs,0},
    {plugins,1959248},
    {other_proc,22328920},
    {metrics,160112},
    {mgmt_db,655320},
    {mnesia,83952},
    {other_ets,2355800},
    {binary,96920},
    {msg_index,47352},
    {code,27101161},
    {atom,992409},
    {other_system,31074022},
    {total,87007232}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3343646720},
{disk_free_limit,50000000},
{disk_free,56257699840},
{file_descriptors,
   [{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,351}]},
{run_queue,0},
{uptime,34537},
{kernel,{net_ticktime,60}}]
实例1(192.168.0.101)运行状态的scrapyd守护进程:

scrapyd

2017-09-11T06:16:07+0600 [-] Loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:16:07+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:16:07+0600 [-] Loaded.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:16:07+0600 [-] Site starting on 6800
2017-09-11T06:16:07+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7f5e265c77a0>
2017-09-11T06:16:07+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
实例2(192.168.0.106)运行状态的scrapyd守护进程:

scrapyd

2017-09-11T06:09:28+0600 [-] Loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:09:28+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:09:28+0600 [-] Loaded.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:09:28+0600 [-] Site starting on 6800
2017-09-11T06:09:28+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7fbe6eaeac20>
2017-09-11T06:09:28+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
worker1日志

更新RabbitMQ服务器设置的代码,然后更新@Tarun Lalwani提出的建议

建议是使用RabbitMQ Server IP - 192.168.0.101:5672而不是 127.0.0.1:5672。根据Tarun Lalwani的建议我更新后得到了以下新问题............

2017-09-11 15:49:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tester2_fda_trial20)
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tester2_fda_trial20.spiders', 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['tester2_fda_trial20.spiders'], 'BOT_NAME': 'tester2_fda_trial20', 'FEED_URI': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'SCHEDULER': 'tester2_fda_trial20.rabbitmq.scheduler.Scheduler', 'TELNETCONSOLE_ENABLED': False, 'LOG_FILE': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'}
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tester2_fda_trial20.pipelines.FdaTrial20Pipeline',
 'tester2_fda_trial20.mongodb.scrapy_mongodb.MongoDBPipeline']
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider opened
2017-09-11 15:49:18 [pika.adapters.base_connection] INFO: Connecting to 192.168.0.101:5672
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Created channel=1
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [pika.channel] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7f94878b8c50>>
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider
    slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <Tester2Fda_Trial20Spider 'tester2_fda_trial20' at 0x7f9484f897d0>>
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/tmp/user/1000/tester2_fda_trial20-10-d4Req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed
AttributeError: 'Tester2Fda_Trial20Spider' object has no attribute 'statstask'
2017-09-11 15:49:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896),
 'log_count/ERROR': 2,
 'log_count/INFO': 10}
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
2017-09-11 15:49:18 [twisted] CRITICAL: Unhandled error in Deferred:
2017-09-11 15:49:18 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
    six.reraise(*exc_info)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
OperationFailure: command SON([('saslStart', 1), ('mechanism', 'SCRAM-SHA-1'), ('payload', Binary('n,,n=tariq,r=MjY5OTQ0OTYwMjA4', 0)), ('autoAuthorize', 1)]) on namespace admin.$cmd failed: Authentication failed.

MongoDBPipeline

# coding:utf-8

import datetime

from pymongo import errors
from pymongo.mongo_client import MongoClient
from pymongo.mongo_replica_set_client import MongoReplicaSetClient
from pymongo.read_preferences import ReadPreference
from scrapy.exporters import BaseItemExporter
try:
    from urllib.parse import quote
except:
    from urllib import quote

def not_set(string):
    """ Check if a string is None or ''

    :returns: bool - True if the string is empty
    """
    if string is None:
        return True
    elif string == '':
        return True
    return False


class MongoDBPipeline(BaseItemExporter):
    """ MongoDB pipeline class """
    # Default options
    config = {
        'uri': 'mongodb://localhost:27017',
        'fsync': False,
        'write_concern': 0,
        'database': 'scrapy-mongodb',
        'collection': 'items',
        'replica_set': None,
        'buffer': None,
        'append_timestamp': False,
        'sharded': False
    }

    # Needed for sending acknowledgement signals to RabbitMQ for all persisted items
    queue = None
    acked_signals = []

    # Item buffer
    item_buffer = dict()

    def load_spider(self, spider):
        self.crawler = spider.crawler
        self.settings = spider.settings
        self.queue = self.crawler.engine.slot.scheduler.queue

    def open_spider(self, spider):
        self.load_spider(spider)

        # Configure the connection
        self.configure()

        self.spidername = spider.name
        self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '@' + self.config['uri'] + '/admin'
        self.shardedcolls = []

        if self.config['replica_set'] is not None:
            self.connection = MongoReplicaSetClient(
                self.config['uri'],
                replicaSet=self.config['replica_set'],
                w=self.config['write_concern'],
                fsync=self.config['fsync'],
                read_preference=ReadPreference.PRIMARY_PREFERRED)
        else:
            # Connecting to a stand alone MongoDB
            self.connection = MongoClient(
                self.config['uri'],
                fsync=self.config['fsync'],
                read_preference=ReadPreference.PRIMARY)

        # Set up the collection
        self.database = self.connection[spider.name]

        # Autoshard the DB
        if self.config['sharded']:
            db_statuses = self.connection['config']['databases'].find({})
            partitioned = []
            notpartitioned = []
            for status in db_statuses:
                if status['partitioned']:
                    partitioned.append(status['_id'])
                else:
                    notpartitioned.append(status['_id'])
            if spider.name in notpartitioned or spider.name not in partitioned:
                try:
                    self.connection.admin.command('enableSharding', spider.name)
                except errors.OperationFailure:
                    pass
            else:
                collections = self.connection['config']['collections'].find({})
                for coll in collections:
                    if (spider.name + '.') in coll['_id']:
                        if coll['dropped'] is not True:
                            if coll['_id'].index(spider.name + '.') == 0:
                                self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:])

    def configure(self):
        """ Configure the MongoDB connection """

        # Set all regular options
        options = [
            ('uri', 'MONGODB_URI'),
            ('fsync', 'MONGODB_FSYNC'),
            ('write_concern', 'MONGODB_REPLICA_SET_W'),
            ('database', 'MONGODB_DATABASE'),
            ('collection', 'MONGODB_COLLECTION'),
            ('replica_set', 'MONGODB_REPLICA_SET'),
            ('buffer', 'MONGODB_BUFFER_DATA'),
            ('append_timestamp', 'MONGODB_ADD_TIMESTAMP'),
            ('sharded', 'MONGODB_SHARDED'),
            ('username', 'MONGODB_USER'),
            ('password', 'MONGODB_PASSWORD')
        ]

        for key, setting in options:
            if not not_set(self.settings[setting]):
                self.config[key] = self.settings[setting]

    def process_item(self, item, spider):
        """ Process the item and add it to MongoDB

        :type item: Item object
        :param item: The item to put into MongoDB
        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: Item object
        """
        item_name = item.__class__.__name__

        # If we are working with a sharded DB, the collection will also be sharded
        if self.config['sharded']:
            if item_name not in self.shardedcolls:
                try:
                    self.connection.admin.command('shardCollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"})
                    self.shardedcolls.append(item_name)
                except errors.OperationFailure:
                    self.shardedcolls.append(item_name)

        itemtoinsert = dict(self._get_serialized_fields(item))

        if self.config['buffer']:
            if item_name not in self.item_buffer:
                self.item_buffer[item_name] = []
                self.item_buffer[item_name].append([])
                self.item_buffer[item_name].append(0)

            self.item_buffer[item_name][1] += 1

            if self.config['append_timestamp']:
                itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}

            self.item_buffer[item_name][0].append(itemtoinsert)

            if self.item_buffer[item_name][1] == self.config['buffer']:
                self.item_buffer[item_name][1] = 0
                self.insert_item(self.item_buffer[item_name][0], spider, item_name)

            return item

        self.insert_item(itemtoinsert, spider, item_name)
        return item

    def close_spider(self, spider):
        """ Method called when the spider is closed

        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: None
        """
        for key in self.item_buffer:
            if self.item_buffer[key][0]:
                self.insert_item(self.item_buffer[key][0], spider, key)

    def insert_item(self, item, spider, item_name):
        """ Process the item and add it to MongoDB

        :type item: (Item object) or [(Item object)]
        :param item: The item(s) to put into MongoDB
        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: Item object
        """
        self.collection = self.database[item_name]

        if not isinstance(item, list):

            if self.config['append_timestamp']:
                item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}

            ack_signal = item['ack_signal']
            item.pop('ack_signal', None)
            self.collection.insert(item, continue_on_error=True)
            if ack_signal not in self.acked_signals:
                self.queue.acknowledge(ack_signal)
                self.acked_signals.append(ack_signal)
        else:
            signals = []
            for eachitem in item:
                signals.append(eachitem['ack_signal'])
                eachitem.pop('ack_signal', None)
            self.collection.insert(item, continue_on_error=True)
            del item[:]
            for ack_signal in signals:
                if ack_signal not in self.acked_signals:
                    self.queue.acknowledge(ack_signal)
                    self.acked_signals.append(ack_signal)

总而言之,我认为问题在于scrapyd守护进程在两个实例上运行,但不知怎的scraperworker1无法访问它,我无法理解,我做了在stackoverflow上找不到任何用例。

在这方面,我们非常感谢任何帮助。提前谢谢!

0 个答案:

没有答案