Question

为了保持井井有条，我确定蜘蛛会填充三个项目类。

每个项目类都有各种填充的字段。

class item_01(Item):
    item1 = Field()
    item2 = Field()
    item3 = Field()

class item_02(Item):
    item4 = Field()
    item5 = Field()

class item_03(Item):
    item6 = Field()
    item7 = Field()
    item8 = Field()

有多个页面要使用相同的项目进行抓取。在蜘蛛中，我使用XPathItemLoader填充'containers'。

目标是将项目传递给mysql管道以填充单个表。但这是问题所在。

当我产生三个容器（每页）时，它们作为三个独立的容器被传递到管道中。它们作为自己的BaseItem通过管道，只填充mysql表的部分，其他列为“NULL”。

我想要做的是将这三个容器重新打包到一个BaseItem中，以便它们作为单个ITEM传递到管道中。

有没有人对重新包装这些物品有任何建议？在蜘蛛或管道中？

由于

Answer 1

我做了这个黑客行动，但如果有人可以改进或提示更好的解决方案，请分享。

将我的物品装入蜘蛛中，如下所示：

items = [item1.load_item(), item2.load_item(), item3.load_item()]

然后我在蜘蛛外定义了一个函数：

def rePackIt(items):
    rePackage = rePackageItems()
    rePack = {}
    for item in items:
        rePack.update(dict(item))

    for key, value in rePack.items():
        rePackage.fields[key] = value
    return rePackage

我添加的items.py中的位置：

class rePackageItems(Item):
    """Repackage the items"""
    pass

蜘蛛完成后抓取页面并加载我产生的项目：

yield rePackIt(items)

带我到pipelines.py。

在process_item打开物品包装，我做了以下工作：

def process_item(self, item, spider):
        items = item.fields

项目现在是一个字典，其中包含蜘蛛的所有提取字段，然后我将其插入到单个数据库表中

重新包装Scrapy蜘蛛项目

1 个答案: