我正在使用scrapy来收集一些数据。我的scrapy程序在一个会话中收集100个元素。我需要将它限制为50或任何随机数。我怎样才能做到这一点?欢迎任何解决方案。提前致谢
# -*- coding: utf-8 -*-
import re
import scrapy
class DmozItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
attr = scrapy.Field()
title = scrapy.Field()
tag = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["raleigh.craigslist.org"]
start_urls = [
"http://raleigh.craigslist.org/search/bab"
]
BASE_URL = 'http://raleigh.craigslist.org/'
def parse(self, response):
links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_attr)
def parse_attr(self, response):
match = re.search(r"(\w+)\.html", response.url)
if match:
item_id = match.group(1)
url = self.BASE_URL + "reply/ral/bab/" + item_id
item = DmozItem()
item["link"] = response.url
item["title"] = "".join(response.xpath("//span[@class='postingtitletext']//text()").extract())
item["tag"] = "".join(response.xpath("//p[@class='attrgroup']/span/b/text()").extract()[0])
return scrapy.Request(url, meta={'item': item}, callback=self.parse_contact)
def parse_contact(self, response):
item = response.meta['item']
item["attr"] = "".join(response.xpath("//div[@class='anonemail']//text()").extract())
return item
答案 0 :(得分:2)
这是CloseSpider
extension和CLOSESPIDER_ITEMCOUNT
设置的目的:
一个整数,指定多个项目。如果蜘蛛擦伤 如果项目和项目通过项目,则超过该数量 管道,蜘蛛将被关闭的原因 closespider_itemcount。如果为零(或未设置),则不会关闭蜘蛛 按已通过的项目数量。
答案 1 :(得分:0)
我尝试了alecxe的答案,但是我必须结合所有三个限制来使其起作用,因此将其留在此处以防其他人遇到相同的问题:
class GenericWebsiteSpider(scrapy.Spider):
"""This generic website spider extracts text from websites"""
name = "generic_website"
custom_settings = {
'CLOSESPIDER_PAGECOUNT': 15,
'CONCURRENT_REQUESTS': 15,
'CLOSESPIDER_ITEMCOUNT': 15
}
...