Question

首先，这是我正在尝试做的事情：

我有一个XMLFeedSpider，它遍历XML文件中的产品（节点）列表，并创建在管道中保存到我的数据库的项目。我第一次看到一个产品，我需要创建请求在产品的url字段上进行一些抓取以获取图像等。如果我看到相同的产品后来读取，我不想浪费时间/资源这样做，只是想跳过这些额外的请求。要查看要跳过的产品，我需要访问我的数据库以查看产品是否存在。

以下是我可以想到的各种方法：

只需为蜘蛛内的每个产品创建一个db请求。这个好像是个坏主意。
在我的商品商店管道中，我已经创建了一个数据库池，如下所示： dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)使用它似乎更有效，所以我不会不断创建新的数据库连接。我不知道如何在我的蜘蛛中访问实例化的管道类（这可能更像是一般的python问题）。
注意：这家伙基本上都是在问同样的问题，但并没有真正得到他想要的答案。 How to get the pipeline object in Scrapy spider
也许在开始抓取之前，将所有产品网址加载到内存中，以便在处理产品时对它们进行比较？在哪里这样做的好地方？
其他建议？

更新：这是我的db pool管道

class PostgresStorePipeline(object):
    """A pipeline to store the item in a MySQL database.
    This implementation uses Twisted's asynchronous database API.
    """

    def __init__(self, dbpool):
        print "Opening connection pool..."
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbargs = dict(
            host=settings['MYSQL_HOST'],
            database=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            password=settings['MYSQL_PASSWD'],
            #charset='utf8',
            #use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)
        return cls(dbpool)

Answer 1

我认为你的意思是URL item，请记住，对于scrapy，item是数据输出，而pipeline是一种交易机制与那些输出项目。

当然，您不需要打开许多连接来执行数据库查询，但您必须执行必要的查询。这取决于您在数据库中有多少记录只能执行一个查询或每URL一个查询，您应该测试哪一个更适合您的情况。

我建议您设置自己的DUPEFILTER_CLASS，例如：

from scrapy.dupefilters import RFPDupeFilter

class DBDupeFilter(RFPDupeFilter):

    def __init__(self, *args, **kwargs):
        # self.cursor = .....                       # instantiate your cursor
        super(DBDupeFilter, self).__init__(*args, **kwargs)

    def request_seen(self, request):
        if self.cursor.execute("myquery"):          # if exists
            return True
        else:
            return super(DBDupeFilter, self).request_seen(request)

    def close(self, reason):
        self.cursor.close()                         # close  your cursor
        super(DBDupeFilter, self).close(reason)

<强>更新

这里的问题是DUPEFILTER_CLASS并没有在其request_seen对象上提供蜘蛛，甚至没有提供构造函数，所以我认为你的最佳镜头是IgnoreRequest 3}}，您可以在其中引发from scrapy.exceptions import IgnoreRequest class DBMiddleware(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): o = cls() crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def spider_opened(self, spider): spider.dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs) def process_request(self, request, spider): if spider.dbpool... # check if request.url inside the database raise IgnoreRequest()例外。

在蜘蛛上实例化数据库连接，您可以在蜘蛛本身（构造函数）上执行此操作，或者您也可以通过中间件或管道上的信号添加它，我们将其添加到中间件：
```
dbpool
```
现在在您的管道上，删除spider的实例化并在必要时从process_item参数中获取它，记住比spider.dbpool收到项目和蜘蛛作为参数，所以你应该可以使用call(command)检查数据库连接。
请记住Downloader Middleware。

这样你应该只在蜘蛛对象中做一个db连接的实例。

如何访问scrapy spider

1 个答案: