Question

我有一个Spider，它会抓取无法保存在一个项目类中的数据。

为了说明，我有一个配置文件项，每个配置文件项可能包含未知数量的注释。这就是我想要实现Profile Item和Comment Item的原因。我知道我可以通过使用yield将它们传递给我的管道。

但是，我不知道具有一个parse_item函数的管道如何处理两个不同的项类？
或者是否可以使用不同的parse_item函数？
或者我是否必须使用多个管道？
或者可以将Iterator写入Scrapy项目字段吗？

comments_list=[]
comments=response.xpath(somexpath)
for x in comments.extract():
        comments_list.append(x)
    ScrapyItem['comments'] =comments_list

Answer 1

默认情况下，每个项目都会遍历每个管道。

例如，如果您产生ProfileItem和CommentItem，则他们都会通过所有管道。如果您有跟踪项类型的管道设置，那么您的process_item方法可能如下所示：

def process_item(self, item, spider):
    self.stats.inc_value('typecount/%s' % type(item).__name__)
    return item

当ProfileItem出现时，'typecount/ProfileItem'会递增。当CommentItem出现时，'typecount/CommentItem'会递增。

但是，如果处理该项类型是唯一的，则可以让一个管道句柄只处理一种类型的项请求，方法是在继续之前检查项类型：

def process_item(self, item, spider):
    if not isinstance(item, ProfileItem):
        return item
    # Handle your Profile Item here.

如果您在不同的管道中设置了上述两个process_item方法，则该项目将通过这两个方法进行跟踪和处理（或在第二个方案中忽略）。

此外，您可以使用一个管道设置来处理所有相关的＆＃39;项目：

def process_item(self, item, spider):
    if isinstance(item, ProfileItem):
        return self.handleProfile(item, spider)
    if isinstance(item, CommentItem):
        return self.handleComment(item, spider)

def handleComment(item, spider):
    # Handle Comment here, return item

def handleProfile(item, spider):
    # Handle profile here, return item

或者，你可以使它变得更加复杂，并开发一个类型委托系统，它可以加载类并调用默认的处理程序方法，类似于Scrapy处理中间件/管道的方式。它真的取决于你需要它多么复杂，以及你想做什么。

Answer 2

定义多个项目当您导出数据时，如果它们具有关系（例如，配置文件1 - N个注释），那么它是一个棘手的事情，并且您必须将它们一起导出，因为管道在不同时间处理了每个项目。此方案的另一种方法是定义自定义Scrapy字段，例如：

class CommentItem(scrapy.Item):
    profile = ProfileField()

class ProfileField(scrapy.item.Field):
   # your business here

但鉴于您必须拥有2个项目的情况，强烈建议为这些类型的项目以及不同的导出器实例使用不同的管道，以便您在不同的文件中获取此信息（如果您正在使用文件）：

<强> settings.py

ITEM_PIPELINES = {
    'pipelines.CommentsPipeline': 1,
    'pipelines.ProfilePipeline': 1,
}

<强> pipelines.py

class CommentsPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, CommentItem):
           # Your business here

class ProfilePipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, ProfileItem):
           # Your business here

Answer 3

直接的方法是让解析器包含两个子解析器，每个数据类型一个。主解析器根据输入确定类型，并将字符串传递给适当的子例程。

第二种方法是按顺序包含解析器：一个解析Profiles并忽略其他所有;第二个解析注释并忽略所有其他内容（与上面相同的原则）。

这会让你前进吗？

Answer 4

我想出了这个解决方案。

我在 setting.py 文件中创建了 ITEM

ITEMS = {
    'project.items.Item1': {
        'filename': 'item1',
    },
    'project.items.Item2': {
        'filename': 'item2',
    },
}

pipeline.py 文件中导入的设置

from scrapy.utils.project import get_project_settings

在 open_spider 方法中为每个项目设置创建文件并附加导出器

for settings_key in self.settings.keys():
    filename = os.path.join(f"output/{self.settings[settings_key]['filename']}_{self.dt}.csv")
    self.settings[settings_key]['file'] = open(filename, 'wb')
    self.settings[settings_key]['exporter'] = CsvItemExporter(
        self.settings[settings_key]['file'], 
        encoding='utf-8', 
        delimiter=';', 
        quoting=csv.QUOTE_NONNUMERIC
    )
    self.settings[settings_key]['exporter'].start_exporting()

在 close_spider 方法中停止所有导出器并关闭所有文件

for settings_key in self.settings.keys():
    self.settings[settings_key]['exporter'].finish_exporting()
    self.settings[settings_key]['file'].close()

在 process_item 方法中，只需使用适当的导出器选择项目并将其导出

item_class = f"{type(item).__module__}.{type(item).__name__}"
settings_item = self.settings.get(item_class)
if settings_item:
    settings_item['exporter'].export_item(item)
return item

Answer 5

@被拒绝的答案是解决方案，但是需要一些调整才能对我起作用，因此请在此处共享。这是我的pipeline.py：

set.seed(24)
df1 <- data.frame(a = sample(letters[1:3], 15, replace = TRUE), 
a_yrs = sample(1979:1983, 15, replace = TRUE), 
b =sample(letters[1:4], 15, replace = TRUE),
b_yrs = sample(1981:1983, 15, replace = TRUE), stringsAsFactors = FALSE )

columns <- c('a', 'b')

Answer 6

来自python>=3.10 https://www.python.org/dev/peps/pep-0622/

基于结构模式匹配实现路由器（@mdkb答案）可能会很方便

！items也是遗留创建的类，因为python>=3.7中有数据类

Answer 7

我建议在ProfileItem中添加评论。这样，您可以在一个人的个人资料中添加多个评论。其次，处理此类数据将更加容易。

Scrapy，Python：一个管道中的多个项目类？

7 个答案: