强迫蜘蛛停止scrapy

时间:2017-10-14 21:47:23

标签: python scrapy scrapy-spider

我在一个项目中有20个蜘蛛,每个蜘蛛都有不同的任务和要抓取的网址(但数据类似,我使用共享的items.pypipelines.py为所有这些,顺便说一句,在我的管道类中我想要一些条件满足指定的蜘蛛停止爬行。 我正在测试

  raise DropItem("terminated by me")

 raise CloseSpider('terminate by me')

但是他们两个都只是停止当前运行的spider和next_page url仍在爬行!!!

pipelines.py

的一部分
class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        raise CloseSpider('terminateby')
        raise DropItem("terminateby")

        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Items added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

和我的蜘蛛

导入scrapy 导入json 来自Maio.items导入MaioItem

类ZhilevanSpider(scrapy.Spider):     name ='德黑兰'     allowed_domains = []     start_urls = [' https://search.Maio.io/json/']     place_code = str(1);

def start_requests(self):

    request_body = {
            "id": 2,
            "jsonrpc": "2.0",
            "method": "getlist",
            "params": [[["myitem", 0, [self.place_code]]], next_pdate]
    }
    # for body in request_body:
    #     request_body = body

    request_body = json.dumps(request_body)
    print(request_body)
    yield scrapy.Request(url='https://search.Maio.io/json/',
                         method="POST",
                         body=request_body,
                         callback = self.parse,
                         headers={'Content-type': 'application/json; charset=UTF-8'}
                         )

def parse(self, response):

    print(response)
    # print(response.body.decode('utf-8'))
    input = (response.body.decode('utf-8'))
    result = json.loads(input)
    # for key,item in result["result"]:
    #     print(key)
    next_pdate = result["result"]["last_post_date"];
    print(result["result"]["last_post_date"])
    for item in result["result"]["post_list"]:
        print("title : {0}".format(item["title"]))
        ads = MaioItem()
        ads['title'] = item["title"]
        ads['desc'] = item["desc"]
    yield ads
    if(next_pdate):
        request_body = {
            "id": 2,
            "jsonrpc": "2.0",
            "method": "getlist",
            "params": [[["myitem", 0, [self.place_code]]], next_pdate]
        }

        request_body = json.dumps(request_body)
        yield scrapy.Request(url='https://search.Maio.io/json/',
                             method="POST",
                             body=request_body,
                             callback=self.parse,
                             headers={'Content-type': 'application/json; charset=UTF-8'}
                             )

**更新**

即使我将sys.exit("SHUT DOWN EVERYTHING!")放入管道中,但下一页仍在运行。

我在每个运行的页面中看到以下日志

 sys.exit("SHUT DOWN EVERYTHING!")
SystemExit: SHUT DOWN EVERYTHING!

3 个答案:

答案 0 :(得分:2)

好的,那么您可以使用CloseSpider异常。

from scrapy.exceptions import CloseSpider
# condition
raise CloseSpider("message")

答案 1 :(得分:1)

如果要从管道中阻止蜘蛛,可以调用引擎的com.android.tools.build:gradle:1.0.0+函数。

close_spider()

答案 2 :(得分:0)

为什么不只使用这个

# with some condition
sys.exit("Closing the spider")