您好,谢谢!
当我运行scrapy时,我将这些项目放在.json中,但是我想要的不是我想要的项目,而是一些垃圾:
download latency,download tieout, depth and download slot are the not desired ones
1 import scrapy
2
3 class LibresSpider(scrapy.Spider):
4 name = 'libres'
5 allowed_domains = ['www.todostuslibros.com']
6 start_urls = ['https://www.todostuslibros.com/mas_vendidos/']
7
8 def parse(self, response):
9 for tfg in response.css('li.row-fluid'):
10 doc={}
11 data = tfg.css('book-basics')
12 doc['titulo'] = tfg.css('h2 a::text').extract_first()
13 doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())
14
15 yield scrapy.Request(doc['url'], callback=self.parse_detail, meta=doc)
16
17 next = response.css('a.next::attr(href)').extract_first()
18 if next is not None:
19 next_page = response.urljoin(next)
20 yield scrapy.Request(next_page, callback=self.parse)
21
22 def parse_detail(self, response):
23
24 detail = response.meta
25 detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
26 detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
27
28 yield detail
我知道响应中附带了那些不需要的数据(第26行),但是我想知道如何避免以json结尾的数据。
答案 0 :(得分:0)
请使用更明确的标题来帮助可能有相同担忧的其他人; “垃圾”是一个非常模糊的词。
您可以在Scrapy文档here
中获得有关meta
属性的更多信息。
包含此请求的任意元数据的字典。这个命令 对于新的请求为空,通常由不同的填充 粗糙的组件(扩展,中间件等)。所以数据 此字典中包含的内容取决于您启用的扩展。
如果要避免在json中由Scrapy填充所有这些信息,可以执行以下操作:
def parse(self, response):
for tfg in response.css('li.row-fluid'):
doc={}
data = tfg.css('book-basics')
doc['titulo'] = tfg.css('h2 a::text').extract_first()
doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())
request = scrapy.Request(doc['url'], callback=self.parse_detail)
request.meta['detail'] = doc
yield request
next = response.css('a.next::attr(href)').extract_first()
if next is not None:
next_page = response.urljoin(next)
yield scrapy.Request(next_page, callback=self.parse)
def parse_detail(self, response):
detail = response.meta['detail']
detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
yield detail