我正在开发一个简单的刮刀,以获得9个gag的帖子和它的图像,但由于一些技术上的困难,我无法停止刮刀,它继续刮刮我不想要。我想增加计数器值并在100个帖子后停止。 但是9gag页面在每个响应中都以一种方式设计,它只提供10个帖子,每次迭代后我的计数器值重置为10,在这种情况下,我的循环运行时间很长,并且永远不会停止。
# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = (
'http://www.9gag.com/',
)
last_gag_id = None
def parse(self, response):
count = 0
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
count +=1
if gag_id:
if (count != 100):
last_gag_id = gag_id[0]
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
yield ninegag_item
else:
break
next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)
print count
items.py代码在这里
from scrapy.item import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
所以我想增加一个全局计数值并尝试通过传递3个参数来解析函数它给出了错误
TypeError: parse() takes exactly 3 arguments (2 given)
那么有没有办法传递全局计数值并在每次迭代后返回它并在100个帖子后停止(假设)。
此处提供整个项目 Github 即使我设置了POST_LIMIT = 100,也会发生无限循环,请参阅此处执行的命令
scrapy crawl first -s POST_LIMIT=10 --output=output.json
答案 0 :(得分:5)
有一个内置设置CLOSESPIDER_PAGECOUNT
,可以通过命令行-s
参数传递或在设置中更改:scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100
一个小小的警告是,如果您启用了缓存,它会将缓存命中计为页数。
答案 1 :(得分:4)
首先:使用self.count
并在parse
之外进行初始化。然后,不要阻止解析项目,而是生成新的requests
。请参阅以下代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = ('http://www.9gag.com/', )
last_gag_id = None
COUNT_MAX = 30
count = 0
def parse(self, response):
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
self.last_gag_id = gag_id[0]
self.count = self.count + 1
yield ninegag_item
if (self.count < self.COUNT_MAX):
next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)
答案 2 :(得分:0)
count
是parse()
方法的本地方法,因此不会在页面之间保留。将count
的所有出现更改为self.count
,使其成为类的实例变量,并且它将在页面之间保留。
答案 3 :(得分:0)
使用-a option.check link
通过crawl命令传递Spider参数答案 4 :(得分:0)
一个人可以使用custom_settings
如下所示使用CLOSESPIDER_PAGECOUNT
。
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = ('http://www.9gag.com/', )
last_gag_id = None
COUNT_MAX = 30
custom_settings = {
'CLOSESPIDER_PAGECOUNT': COUNT_MAX
}
def parse(self, response):
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
self.last_gag_id = gag_id[0]
yield ninegag_item
next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)