Question

我没有特定的代码问题我只是不确定如何使用Scrapy框架在逻辑上解决以下问题：

我想要抓取的数据结构通常是每个项目的表格行。直截了当，对吧？

最终，我想为每一行抓取标题，截止日期和详细信息。标题和截止日期可立即在页面上显示...

但是详细信息本身不在表格中 - 而是指向包含详细信息的页面的链接（如果这在某个表格中没有意义）：

|-------------------------------------------------|
|             Title              |    Due Date    |
|-------------------------------------------------|
| Job Title (Clickable Link)     |    1/1/2012    |
| Other Job (Link)               |    3/2/2012    |
|--------------------------------|----------------|

即使在阅读了Scrapy文档的 CrawlSpider 部分之后，我仍然不知道如何使用回调和请求在逻辑上传递该项目。

Answer 1

请先阅读docs以了解我的意见。

答案：

要抓取其他页面上的其他字段，在解析方法中提取带有附加信息的页面的URL，从该解析方法创建并返回带有该URL的Request对象，并通过其{{1}传递已提取的数据参数。

how do i merge results from target page to current page in scrapy?

Answer 2

来自scrapy documentation

的示例

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                     callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

Answer 3

您还可以使用Python functools.partial通过其他参数将item或任何其他可序列化数据传递给下一个Scrapy回调。

类似的东西：

import functools

# Inside your Spider class:

def parse(self, response):
  # ...
  # Process the first response here, populate item and next_url.
  # ...
  callback = functools.partial(self.parse_next, item, someotherarg)
  return Request(next_url, callback=callback)

def parse_next(self, item, someotherarg, response):
  # ...
  # Process the second response here.
  # ...
  return item

Scrapy：关注链接以获取其他商品数据？

3 个答案: