如何避免不希望的字段作为响应(草率)

时间:2018-11-14 22:48:34

标签: python scrapy web-crawler

您好,谢谢!

当我运行scrapy时,我将这些项目放在.json中,但是我想要的不是我想要的项目,而是一些垃圾:

download latency,download tieout, depth and download slot are the not desired ones

 1 import scrapy
 2
 3 class LibresSpider(scrapy.Spider):
 4     name = 'libres'
 5     allowed_domains = ['www.todostuslibros.com']
 6     start_urls = ['https://www.todostuslibros.com/mas_vendidos/']
 7
 8     def parse(self, response):
 9         for tfg in response.css('li.row-fluid'):
10             doc={}
11             data = tfg.css('book-basics')
12             doc['titulo'] = tfg.css('h2 a::text').extract_first()
13             doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())
14
15             yield scrapy.Request(doc['url'], callback=self.parse_detail, meta=doc)
16
17         next = response.css('a.next::attr(href)').extract_first()
18         if next is not None:
19            next_page = response.urljoin(next)
20            yield scrapy.Request(next_page, callback=self.parse)
21
22     def parse_detail(self, response):
23
24         detail = response.meta
25         detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
26         detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
27
28         yield detail

我知道响应中附带了那些不需要的数据(第26行),但是我想知道如何避免以json结尾的数据。

1 个答案:

答案 0 :(得分:0)

请使用更明确的标题来帮助可能有相同担忧的其他人; “垃圾”是一个非常模糊的词。

您可以在Scrapy文档here

中获得有关meta属性的更多信息。
  

包含此请求的任意元数据的字典。这个命令   对于新的请求为空,通常由不同的填充   粗糙的组件(扩展,中间件等)。所以数据   此字典中包含的内容取决于您启用的扩展。

如果要避免在json中由Scrapy填充所有这些信息,可以执行以下操作:

def parse(self, response):
  for tfg in response.css('li.row-fluid'):
    doc={}
    data = tfg.css('book-basics')
    doc['titulo'] = tfg.css('h2 a::text').extract_first()
    doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())

    request = scrapy.Request(doc['url'], callback=self.parse_detail)
    request.meta['detail'] = doc
    yield request

  next = response.css('a.next::attr(href)').extract_first()
  if next is not None:
    next_page = response.urljoin(next)
    yield scrapy.Request(next_page, callback=self.parse)

def parse_detail(self, response):
  detail = response.meta['detail']
  detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
  detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
  yield detail