Question

您好，谢谢！

当我运行scrapy时，我将这些项目放在.json中，但是我想要的不是我想要的项目，而是一些垃圾：

download latency,download tieout, depth and download slot are the not desired ones

 1 import scrapy
 2
 3 class LibresSpider(scrapy.Spider):
 4     name = 'libres'
 5     allowed_domains = ['www.todostuslibros.com']
 6     start_urls = ['https://www.todostuslibros.com/mas_vendidos/']
 7
 8     def parse(self, response):
 9         for tfg in response.css('li.row-fluid'):
10             doc={}
11             data = tfg.css('book-basics')
12             doc['titulo'] = tfg.css('h2 a::text').extract_first()
13             doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())
14
15             yield scrapy.Request(doc['url'], callback=self.parse_detail, meta=doc)
16
17         next = response.css('a.next::attr(href)').extract_first()
18         if next is not None:
19            next_page = response.urljoin(next)
20            yield scrapy.Request(next_page, callback=self.parse)
21
22     def parse_detail(self, response):
23
24         detail = response.meta
25         detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
26         detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
27
28         yield detail

我知道响应中附带了那些不需要的数据（第26行），但是我想知道如何避免以json结尾的数据。

Answer 1

请使用更明确的标题来帮助可能有相同担忧的其他人； “垃圾”是一个非常模糊的词。

您可以在Scrapy文档here

中获得有关meta属性的更多信息。

包含此请求的任意元数据的字典。这个命令对于新的请求为空，通常由不同的填充粗糙的组件（扩展，中间件等）。所以数据此字典中包含的内容取决于您启用的扩展。

如果要避免在json中由Scrapy填充所有这些信息，可以执行以下操作：

def parse(self, response):
  for tfg in response.css('li.row-fluid'):
    doc={}
    data = tfg.css('book-basics')
    doc['titulo'] = tfg.css('h2 a::text').extract_first()
    doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())

    request = scrapy.Request(doc['url'], callback=self.parse_detail)
    request.meta['detail'] = doc
    yield request

  next = response.css('a.next::attr(href)').extract_first()
  if next is not None:
    next_page = response.urljoin(next)
    yield scrapy.Request(next_page, callback=self.parse)

def parse_detail(self, response):
  detail = response.meta['detail']
  detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
  detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
  yield detail

如何避免不希望的字段作为响应（草率）

1 个答案: