Question

我在一个项目中有20个蜘蛛，每个蜘蛛都有不同的任务和要抓取的网址（但数据类似，我使用共享的items.py和pipelines.py为所有这些，顺便说一句，在我的管道类中我想要一些条件满足指定的蜘蛛停止爬行。我正在测试

  raise DropItem("terminated by me")

和

 raise CloseSpider('terminate by me')

但是他们两个都只是停止当前运行的spider和next_page url仍在爬行!!!

pipelines.py

的一部分

class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        raise CloseSpider('terminateby')
        raise DropItem("terminateby")

        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Items added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

和我的蜘蛛

导入scrapy 导入json 来自Maio.items导入MaioItem

类ZhilevanSpider（scrapy.Spider）： name =＆＃39;德黑兰＆＃39; allowed_domains = [] start_urls = [＆＃39; https://search.Maio.io/json/＆＃39;] place_code = str（1）;

def start_requests(self):

    request_body = {
            "id": 2,
            "jsonrpc": "2.0",
            "method": "getlist",
            "params": [[["myitem", 0, [self.place_code]]], next_pdate]
    }
    # for body in request_body:
    #     request_body = body

    request_body = json.dumps(request_body)
    print(request_body)
    yield scrapy.Request(url='https://search.Maio.io/json/',
                         method="POST",
                         body=request_body,
                         callback = self.parse,
                         headers={'Content-type': 'application/json; charset=UTF-8'}
                         )

def parse(self, response):

    print(response)
    # print(response.body.decode('utf-8'))
    input = (response.body.decode('utf-8'))
    result = json.loads(input)
    # for key,item in result["result"]:
    #     print(key)
    next_pdate = result["result"]["last_post_date"];
    print(result["result"]["last_post_date"])
    for item in result["result"]["post_list"]:
        print("title : {0}".format(item["title"]))
        ads = MaioItem()
        ads['title'] = item["title"]
        ads['desc'] = item["desc"]
    yield ads
    if(next_pdate):
        request_body = {
            "id": 2,
            "jsonrpc": "2.0",
            "method": "getlist",
            "params": [[["myitem", 0, [self.place_code]]], next_pdate]
        }

        request_body = json.dumps(request_body)
        yield scrapy.Request(url='https://search.Maio.io/json/',
                             method="POST",
                             body=request_body,
                             callback=self.parse,
                             headers={'Content-type': 'application/json; charset=UTF-8'}
                             )

**更新**

即使我将sys.exit("SHUT DOWN EVERYTHING!")放入管道中，但下一页仍在运行。

我在每个运行的页面中看到以下日志

 sys.exit("SHUT DOWN EVERYTHING!")
SystemExit: SHUT DOWN EVERYTHING!

Answer 1

好的，那么您可以使用CloseSpider异常。

from scrapy.exceptions import CloseSpider
# condition
raise CloseSpider("message")

Answer 2

如果要从管道中阻止蜘蛛，可以调用引擎的com.android.tools.build:gradle:1.0.0+函数。

close_spider()

Answer 3

为什么不只使用这个

# with some condition
sys.exit("Closing the spider")

强迫蜘蛛停止scrapy

3 个答案: