我在一个项目中有20个蜘蛛,每个蜘蛛都有不同的任务和要抓取的网址(但数据类似,我使用共享的items.py
和pipelines.py
为所有这些,顺便说一句,在我的管道类中我想要一些条件满足指定的蜘蛛停止爬行。
我正在测试
raise DropItem("terminated by me")
和
raise CloseSpider('terminate by me')
但是他们两个都只是停止当前运行的spider和next_page url仍在爬行!!!
pipelines.py
的一部分class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
raise CloseSpider('terminateby')
raise DropItem("terminateby")
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.collection.insert(dict(item))
log.msg("Items added to MongoDB database!",
level=log.DEBUG, spider=spider)
return item
和我的蜘蛛
导入scrapy 导入json 来自Maio.items导入MaioItem
类ZhilevanSpider(scrapy.Spider): name ='德黑兰' allowed_domains = [] start_urls = [' https://search.Maio.io/json/'] place_code = str(1);
def start_requests(self):
request_body = {
"id": 2,
"jsonrpc": "2.0",
"method": "getlist",
"params": [[["myitem", 0, [self.place_code]]], next_pdate]
}
# for body in request_body:
# request_body = body
request_body = json.dumps(request_body)
print(request_body)
yield scrapy.Request(url='https://search.Maio.io/json/',
method="POST",
body=request_body,
callback = self.parse,
headers={'Content-type': 'application/json; charset=UTF-8'}
)
def parse(self, response):
print(response)
# print(response.body.decode('utf-8'))
input = (response.body.decode('utf-8'))
result = json.loads(input)
# for key,item in result["result"]:
# print(key)
next_pdate = result["result"]["last_post_date"];
print(result["result"]["last_post_date"])
for item in result["result"]["post_list"]:
print("title : {0}".format(item["title"]))
ads = MaioItem()
ads['title'] = item["title"]
ads['desc'] = item["desc"]
yield ads
if(next_pdate):
request_body = {
"id": 2,
"jsonrpc": "2.0",
"method": "getlist",
"params": [[["myitem", 0, [self.place_code]]], next_pdate]
}
request_body = json.dumps(request_body)
yield scrapy.Request(url='https://search.Maio.io/json/',
method="POST",
body=request_body,
callback=self.parse,
headers={'Content-type': 'application/json; charset=UTF-8'}
)
**更新**
即使我将sys.exit("SHUT DOWN EVERYTHING!")
放入管道中,但下一页仍在运行。
我在每个运行的页面中看到以下日志
sys.exit("SHUT DOWN EVERYTHING!")
SystemExit: SHUT DOWN EVERYTHING!
答案 0 :(得分:2)
好的,那么您可以使用CloseSpider异常。
from scrapy.exceptions import CloseSpider
# condition
raise CloseSpider("message")
答案 1 :(得分:1)
如果要从管道中阻止蜘蛛,可以调用引擎的com.android.tools.build:gradle:1.0.0+
函数。
close_spider()
答案 2 :(得分:0)
为什么不只使用这个
# with some condition
sys.exit("Closing the spider")