我目前正致力于构建"Scrapy spiders control panel",我正在测试[分布式多用户Scrapy蜘蛛控制面板] https://github.com/aaldaber/Distributed-Multi-User-Scrapy-System-with-a-Web-UI上提供的现有解决方案 .

我试图在我的本地Ubuntu开发机器上运行它,但有 scrapd 守护进程的问题 . 其中一名 Worker , linkgenerator 正在工作但 scraper 因为 Worker 1无法工作 . 我无法弄清楚为什么 scrapyd 不会在另一个本地实例上运行 .
背景有关配置的信息 .
该应用程序捆绑了Django,Scrapy,MongoDB的Pipeline(用于保存已删除的项目)和RabbitMQ的Scrapy调度程序(用于在工作者之间分发链接) . 我有两个本地Ubuntu实例,其中Django,MongoDB,Scrapyd守护进程和RabbitMQ服务器在 Instance1 上运行 . 在另一个Scrapyd守护程序上运行 Instance2 . RabbitMQ Worker :

linkgenerator worker1

实例的IP配置:

IP对于本地Ubuntu Instance1:192.168.0.101 IP用于本地Ubuntu Instance2:192.168.0.106

使用的工具清单:

MongoDB服务器RabbitMQ服务器Scrapy Scrapyd API一个RabbitMQ linkgenerator工作者(WorkerName:linkgenerator)服务器安装了Scrapy并在本地Ubuntu Instance1上运行scrapyd守护程序:192.168.0.101另一个安装了Scrapy且运行scrapyd的RabbitMQ scraper worker(WorkerName:worker1)服务器本地Ubuntu Instance2上的守护进程:192.168.0.106

Instance1:192.168.0.101

运行Django,RabbitMQ,scrapyd守护程序服务器的“Instance1” - IP:192.168.0.101

Instance2:192.168.0.106

Scrapy安装在instance2上并运行 scrapyd 守护程序

Scrapy控制面板UI快照:

from snapshot, control panel outlook can be been seen, there are two workers, linkgenerator worked successfully but worker1 did not, the logs given in the end of the post

RabbitMQ状态信息

linkgenerator工作者可以成功地将消息推送到RabbitMQ队列,linkgenerator spider生成 start_urls ,因为“刮板蜘蛛*被刮刀(worker1)消耗,这不起作用,请在帖子末尾看到worker1的日志

RabbitMQ设置

以下文件包含MongoDB和RabbitMQ的设置:

SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = 'ScrapyDevU79'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'

MONGODB_PUBLIC_ADDRESS = 'OneScience:27017'  # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'localhost:27017'  # Actual uri to connect to DB
MONGODB_USER = 'tariq'
MONGODB_PASSWORD = 'toor'
MONGODB_SHARDED = True
MONGODB_BUFFER_DATA = 100

# Set your link generator worker address here
LINK_GENERATOR = 'http://192.168.0.101:6800'
SCRAPERS = ['http://192.168.0.106:6800']
LINUX_USER_CREATION_ENABLED = False  # Set this to True if you want a linux user account

linkgenerator scrapy.cfg设置:

[settings]
default = tester2_fda_trial20.settings
[deploy:linkgenerator]
url = http://192.168.0.101:6800
project = tester2_fda_trial20

scraper scrapy.cfg设置:

[settings]
default = tester2_fda_trial20.settings

[deploy:worker1]
url = http://192.168.0.101:6800
project = tester2_fda_trial20

实例1的scrapyd.conf文件设置(192.168.0.101)
cat /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir   = /var/lib/scrapyd/eggs
dbs_dir    = /var/lib/scrapyd/dbs
items_dir  = /var/lib/scrapyd/items
logs_dir   = /var/log/scrapyd

max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

scrapyd.conf Instance2的文件设置(192.168.0.106)
cat /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir   = /var/lib/scrapyd/eggs
dbs_dir    = /var/lib/scrapyd/dbs
items_dir  = /var/lib/scrapyd/items
logs_dir   = /var/log/scrapyd

max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

RabbitMQ状态
sudo service rabbitmq-server status

[sudo] password for mtaziz:
Status of node rabbit@ScrapyDevU79
[{pid,53715},
{running_applications,
   [{rabbitmq_shovel_management,
        "Management extension for the Shovel plugin","3.6.11"},
    {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.6.11"},
    {rabbitmq_management,"RabbitMQ Management Console","3.6.11"},
    {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.11"},
    {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.11"},
    {rabbit,"RabbitMQ","3.6.11"},
    {os_mon,"CPO  CXC 138 46","2.2.14"},
    {cowboy,"Small, fast, modular HTTP server.","1.0.4"},
    {ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
    {ssl,"Erlang/OTP SSL application","5.3.2"},
    {public_key,"Public key infrastructure","0.21"},
    {cowlib,"Support library for manipulating Web protocols.","1.0.2"},
    {crypto,"CRYPTO version 2","3.2"},
    {amqp_client,"RabbitMQ AMQP Client","3.6.11"},
    {rabbit_common,
        "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
        "3.6.11"},
    {inets,"INETS  CXC 138 49","5.9.7"},
    {mnesia,"MNESIA  CXC 138 12","4.11"},
    {compiler,"ERTS  CXC 138 10","4.9.4"},
    {xmerl,"XML parser","1.3.5"},
    {syntax_tools,"Syntax tools","1.6.12"},
    {asn1,"The Erlang ASN1 compiler version 2.0.4","2.0.4"},
    {sasl,"SASL  CXC 138 11","2.3.4"},
    {stdlib,"ERTS  CXC 138 10","1.19.4"},
    {kernel,"ERTS  CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,
   "Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
{memory,
   [{connection_readers,0},
    {connection_writers,0},
    {connection_channels,0},
    {connection_other,6856},
    {queue_procs,145160},
    {queue_slave_procs,0},
    {plugins,1959248},
    {other_proc,22328920},
    {metrics,160112},
    {mgmt_db,655320},
    {mnesia,83952},
    {other_ets,2355800},
    {binary,96920},
    {msg_index,47352},
    {code,27101161},
    {atom,992409},
    {other_system,31074022},
    {total,87007232}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3343646720},
{disk_free_limit,50000000},
{disk_free,56257699840},
{file_descriptors,
   [{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,351}]},
{run_queue,0},
{uptime,34537},
{kernel,{net_ticktime,60}}]

实例1(192.168.0.101)上的scrapyd守护程序运行状态:
scrapyd

2017-09-11T06:16:07+0600 [-] Loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:16:07+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:16:07+0600 [-] Loaded.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:16:07+0600 [-] Site starting on 6800
2017-09-11T06:16:07+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7f5e265c77a0>
2017-09-11T06:16:07+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"

实例2(192.168.0.106)运行状态的报废守护进程:
scrapyd

2017-09-11T06:09:28+0600 [-] Loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:09:28+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:09:28+0600 [-] Loaded.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:09:28+0600 [-] Site starting on 6800
2017-09-11T06:09:28+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7fbe6eaeac20>
2017-09-11T06:09:28+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"

worker1日志

更新了RabbitMQ服务器设置的代码,然后是@Tarun Lalwani提出的建议

建议是使用 RabbitMQ Server IP - 192.168.0.101:5672而不是127.0.0.1:5672 . 根据Tarun Lalwani的建议我更新后得到了以下新问题............

2017-09-11 15:49:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tester2_fda_trial20)
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tester2_fda_trial20.spiders', 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['tester2_fda_trial20.spiders'], 'BOT_NAME': 'tester2_fda_trial20', 'FEED_URI': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'SCHEDULER': 'tester2_fda_trial20.rabbitmq.scheduler.Scheduler', 'TELNETCONSOLE_ENABLED': False, 'LOG_FILE': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'}
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tester2_fda_trial20.pipelines.FdaTrial20Pipeline',
 'tester2_fda_trial20.mongodb.scrapy_mongodb.MongoDBPipeline']
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider opened
2017-09-11 15:49:18 [pika.adapters.base_connection] INFO: Connecting to 192.168.0.101:5672
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Created channel=1
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [pika.channel] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7f94878b8c50>>
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider
    slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <Tester2Fda_Trial20Spider 'tester2_fda_trial20' at 0x7f9484f897d0>>
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/tmp/user/1000/tester2_fda_trial20-10-d4Req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed
AttributeError: 'Tester2Fda_Trial20Spider' object has no attribute 'statstask'
2017-09-11 15:49:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896),
 'log_count/ERROR': 2,
 'log_count/INFO': 10}
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
2017-09-11 15:49:18 [twisted] CRITICAL: Unhandled error in Deferred:
2017-09-11 15:49:18 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
    six.reraise(*exc_info)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
OperationFailure: command SON([('saslStart', 1), ('mechanism', 'SCRAM-SHA-1'), ('payload', Binary('n,,n=tariq,r=MjY5OTQ0OTYwMjA4', 0)), ('autoAuthorize', 1)]) on namespace admin.$cmd failed: Authentication failed.

MongoDBPipeline

# coding:utf-8

import datetime

from pymongo import errors
from pymongo.mongo_client import MongoClient
from pymongo.mongo_replica_set_client import MongoReplicaSetClient
from pymongo.read_preferences import ReadPreference
from scrapy.exporters import BaseItemExporter
try:
    from urllib.parse import quote
except:
    from urllib import quote

def not_set(string):
    """ Check if a string is None or ''

    :returns: bool - True if the string is empty
    """
    if string is None:
        return True
    elif string == '':
        return True
    return False


class MongoDBPipeline(BaseItemExporter):
    """ MongoDB pipeline class """
    # Default options
    config = {
        'uri': 'mongodb://localhost:27017',
        'fsync': False,
        'write_concern': 0,
        'database': 'scrapy-mongodb',
        'collection': 'items',
        'replica_set': None,
        'buffer': None,
        'append_timestamp': False,
        'sharded': False
    }

    # Needed for sending acknowledgement signals to RabbitMQ for all persisted items
    queue = None
    acked_signals = []

    # Item buffer
    item_buffer = dict()

    def load_spider(self, spider):
        self.crawler = spider.crawler
        self.settings = spider.settings
        self.queue = self.crawler.engine.slot.scheduler.queue

    def open_spider(self, spider):
        self.load_spider(spider)

        # Configure the connection
        self.configure()

        self.spidername = spider.name
        self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '@' + self.config['uri'] + '/admin'
        self.shardedcolls = []

        if self.config['replica_set'] is not None:
            self.connection = MongoReplicaSetClient(
                self.config['uri'],
                replicaSet=self.config['replica_set'],
                w=self.config['write_concern'],
                fsync=self.config['fsync'],
                read_preference=ReadPreference.PRIMARY_PREFERRED)
        else:
            # Connecting to a stand alone MongoDB
            self.connection = MongoClient(
                self.config['uri'],
                fsync=self.config['fsync'],
                read_preference=ReadPreference.PRIMARY)

        # Set up the collection
        self.database = self.connection[spider.name]

        # Autoshard the DB
        if self.config['sharded']:
            db_statuses = self.connection['config']['databases'].find({})
            partitioned = []
            notpartitioned = []
            for status in db_statuses:
                if status['partitioned']:
                    partitioned.append(status['_id'])
                else:
                    notpartitioned.append(status['_id'])
            if spider.name in notpartitioned or spider.name not in partitioned:
                try:
                    self.connection.admin.command('enableSharding', spider.name)
                except errors.OperationFailure:
                    pass
            else:
                collections = self.connection['config']['collections'].find({})
                for coll in collections:
                    if (spider.name + '.') in coll['_id']:
                        if coll['dropped'] is not True:
                            if coll['_id'].index(spider.name + '.') == 0:
                                self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:])

    def configure(self):
        """ Configure the MongoDB connection """

        # Set all regular options
        options = [
            ('uri', 'MONGODB_URI'),
            ('fsync', 'MONGODB_FSYNC'),
            ('write_concern', 'MONGODB_REPLICA_SET_W'),
            ('database', 'MONGODB_DATABASE'),
            ('collection', 'MONGODB_COLLECTION'),
            ('replica_set', 'MONGODB_REPLICA_SET'),
            ('buffer', 'MONGODB_BUFFER_DATA'),
            ('append_timestamp', 'MONGODB_ADD_TIMESTAMP'),
            ('sharded', 'MONGODB_SHARDED'),
            ('username', 'MONGODB_USER'),
            ('password', 'MONGODB_PASSWORD')
        ]

        for key, setting in options:
            if not not_set(self.settings[setting]):
                self.config[key] = self.settings[setting]

    def process_item(self, item, spider):
        """ Process the item and add it to MongoDB

        :type item: Item object
        :param item: The item to put into MongoDB
        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: Item object
        """
        item_name = item.__class__.__name__

        # If we are working with a sharded DB, the collection will also be sharded
        if self.config['sharded']:
            if item_name not in self.shardedcolls:
                try:
                    self.connection.admin.command('shardCollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"})
                    self.shardedcolls.append(item_name)
                except errors.OperationFailure:
                    self.shardedcolls.append(item_name)

        itemtoinsert = dict(self._get_serialized_fields(item))

        if self.config['buffer']:
            if item_name not in self.item_buffer:
                self.item_buffer[item_name] = []
                self.item_buffer[item_name].append([])
                self.item_buffer[item_name].append(0)

            self.item_buffer[item_name][1] += 1

            if self.config['append_timestamp']:
                itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}

            self.item_buffer[item_name][0].append(itemtoinsert)

            if self.item_buffer[item_name][1] == self.config['buffer']:
                self.item_buffer[item_name][1] = 0
                self.insert_item(self.item_buffer[item_name][0], spider, item_name)

            return item

        self.insert_item(itemtoinsert, spider, item_name)
        return item

    def close_spider(self, spider):
        """ Method called when the spider is closed

        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: None
        """
        for key in self.item_buffer:
            if self.item_buffer[key][0]:
                self.insert_item(self.item_buffer[key][0], spider, key)

    def insert_item(self, item, spider, item_name):
        """ Process the item and add it to MongoDB

        :type item: (Item object) or [(Item object)]
        :param item: The item(s) to put into MongoDB
        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: Item object
        """
        self.collection = self.database[item_name]

        if not isinstance(item, list):

            if self.config['append_timestamp']:
                item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}

            ack_signal = item['ack_signal']
            item.pop('ack_signal', None)
            self.collection.insert(item, continue_on_error=True)
            if ack_signal not in self.acked_signals:
                self.queue.acknowledge(ack_signal)
                self.acked_signals.append(ack_signal)
        else:
            signals = []
            for eachitem in item:
                signals.append(eachitem['ack_signal'])
                eachitem.pop('ack_signal', None)
            self.collection.insert(item, continue_on_error=True)
            del item[:]
            for ack_signal in signals:
                if ack_signal not in self.acked_signals:
                    self.queue.acknowledge(ack_signal)
                    self.acked_signals.append(ack_signal)

总而言之,我认为问题在于在两个实例上运行的 scrapyd 守护进程但不知何故 scraperworker1 无法访问它,我无法弄清楚,我没有在stackoverflow上找到任何用例 .

在这方面,任何帮助都受到高度赞赏 . 先感谢您!