Question

在我的 items.py 中：

class NewAdsItem(Item):
    AdId        = Field()
    DateR       = Field()
    AdURL       = Field()

在我的 pipelines.py 中：

import sqlite3
from scrapy.conf import settings

con = None
class DbPipeline(object):

    def __init__(self):
        self.setupDBCon()
        self.createTables()

    def setupDBCon(self):
        # This is NOT OK!
        # I want to get the items already HERE!
        dbfile = settings.get('SQLITE_FILE')
        self.con = sqlite3.connect(dbfile)
        self.cur = self.con.cursor()

    def createTables(self):
        # OR optionally HERE.
        self.createDbTable()

    ...

    def process_item(self, item, spider):
        self.storeInDb(item)
        return item

    def storeInDb(self, item):
        # This is OK, I CAN get the items in here, using: 
        # item.keys() and/or item.values()
        sql = "INSERT INTO {0} ({1}) VALUES ({2})".format(self.dbtable, ','.join(item.keys()), ','.join(['?'] * len(item.keys())) )
        ...

在process_item()（在 pipelines.py 中）之前，如何从 items.py 获得项目列表名称（例如“ AdId”等）被执行？

我使用scrapy runspider myspider.py来执行。

我已经尝试像这样def setupDBCon(self, item)添加“ item”和/或“ spider”，但这没有用，结果是： TypeError: setupDBCon() missing 1 required positional argument: 'item'

更新：2018-10-08

结果（A）：

部分遵循@granitosaurus的解决方案，我发现我可以通过以下方式获得钥匙列表：

在我的主要蜘蛛代码中添加（a）：from adbot.items import NewAdsItem。
在上述类别中添加（b）：ikeys = NewAdsItem.fields.keys()。
然后我可以通过以下方式从我的pipelines.py中访问键：

    def open_spider(self, spider):
        self.ikeys = list(spider.ikeys)
        print("Keys in pipelines: \t%s" % ",".join(self.ikeys) )
        #self.createDbTable(ikeys)

但是，此方法存在两个问题：

我无法将 ikeys 列表添加到createDbTable()中。（我一直到处都是关于缺少参数的错误。）
ikeys 列表（已检索）已重新排列，并且未保持项目的顺序，因为它们出现在项目中。 py ，这部分地破坏了目标。当所有文档都说Python3应该保留字典和列表等的顺序时，我仍然不明白为什么它们会乱序。同时，当使用process_item()并通过以下方式获取项目时：{{ 1}}的顺序保持不变。

结果（B）：

最终，修复（A）太费力又复杂，所以我只是将相关的item.keys() Class 导入了我的items.py，并使用 item 列表作为全局变量，如下所示：

pipelines.py

在这种情况下，我只是决定接受列表按字母顺序对似乎进行排序，并通过更改键名来解决该问题。（作弊！）

这令人失望，因为代码丑陋且扭曲。任何更好的建议将不胜感激。

Answer 1

Scrapy管道具有3种连接方法：

process_item(self, item, spider)
  每个项目管道组件都调用此方法。   process_item（）必须：返回带有数据的字典，返回Item（或任何后代类）对象，返回Twisted Deferred或引发DropItem异常。删除的项目不再由其他管道组件处理。

open_spider(self, spider)
  打开蜘蛛网后将调用此方法。

close_spider(self, spider)
  蜘蛛关闭时会调用此方法。

https://doc.scrapy.org/en/latest/topics/item-pipeline.html

因此，您只能使用process_item方法访问项目。

如果您想获取物品类别，可以将其附加到蜘蛛类别：

class MySpider(Spider):
    item_cls = MyItem

class MyPipeline:
    def open_spider(self, spider):
        fields = spider.item_cls.fields
        # fields is a dictionary of key: default value
        self.setup_table(fields)

或者，您可以在process_item方法本身期间进行延迟加载：

class MyPipeline:
    item = None

def process_item(self, item, spider):
    if not self.item:
        self.item = item
        self.setup_table(item)

如何将Scrapy项目列表从items.py获取/导入到pipelines.py？

1 个答案: