Question

我正在使用Scrapy抓取几个可能共享冗余信息的网站。

对于我抓取的每个页面，我将页面的url，标题和html代码存储到mongoDB中。我想避免数据库中的重复，因此，我实现了一个管道，以检查是否已经存储了类似的项目。在这种情况下，我提出DropItem例外。

我的问题是，每当我通过raison DropItem例外删除项目时，Scrapy会将项目的全部内容显示在日志（stdout或文件）中。因为我正在提取每个被抓取页面的整个HTML代码，所以如果丢弃，整个HTML代码将显示在日志中。

如果没有显示内容，我怎么能默默地删除项目？

感谢您的时间！

class DatabaseStorage(object):
    """ Pipeline in charge of database storage.

    The 'whole' item (with HTML and text) will be stored in mongoDB.
    """

    def __init__(self):
        self.mongo = MongoConnector().collection

    def process_item(self, item, spider):
        """ Method in charge of item valdation and processing. """
        if item['html'] and item['title'] and item['url']:
            # insert item in mongo if not already present
            if self.mongo.find_one({'title': item['title']}):
                raise DropItem('Item already in db')
            else:
                self.mongo.insert(dict(item))
                log.msg("Item %s scraped" % item['title'],
                    level=log.INFO, spider=spider)
        else:
            raise DropItem('Missing information on item %s' % (
                'scraped from ' + item.get('url')
                or item.get('title')))
        return item

Answer 1

执行此操作的正确方法是为项目实现自定义LogFormatter，并更改已删除项目的日志记录级别。

示例：

from scrapy import log
from scrapy import logformatter

class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': log.DEBUG,
            'format': logformatter.DROPPEDFMT,
            'exception': exception,
            'item': item,
        }

然后在您的设置文件中，例如：

LOG_FORMATTER = 'apps.crawler.spiders.PoliteLogFormatter'

我回来后运气不好＆＃34;没有，＆＃34;这在未来的管道中引起了例外。

Answer 2

在最近的Scrapy版本中，这已经有所改变。我从@jimmytheleaf复制了代码并修复它以使用最近的Scrapy：

import logging
from scrapy import logformatter


class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': logging.INFO,
            'msg': logformatter.DROPPEDMSG,
            'args': {
                'exception': exception,
                'item': item,
            }
        }

Answer 3

好的，在发布问题之前我找到了答案。我仍然认为答案对于遇到同样问题的人来说可能是有价值的。

您只需返回None值，而不是删除带有DropItem异常的对象：

def process_item(self, item, spider):
    """ Method in charge of item valdation and processing. """
    if item['html'] and item['title'] and item['url']:
        # insert item in mongo if not already present
        if self.mongo.find_one({'url': item['url']}):
            return
        else:
            self.mongo.insert(dict(item))
            log.msg("Item %s scraped" % item['title'],
                level=log.INFO, spider=spider)
    else:
        raise DropItem('Missing information on item %s' % (
           'scraped from ' + item.get('url')
            or item.get('title')))
    return item

Answer 4

解决此问题的另一种方法是调整repr子类中的scrapy.Item方法

class SomeItem(scrapy.Item):
    scrape_date = scrapy.Field()
    spider_name = scrapy.Field()
    ...

    def __repr__(self):
        return ""

这样，该项目将根本不会显示在日志中。

Answer 5

正如Levon在先前的评论中指出的那样，也有可能使您正在处理的Item的__repr__函数超载。

这样，该消息将显示在Scrapy日志中，但是您将无法控制在日志中显示的代码长度，例如，网页的前150个字符。假设您有一个表示这样的HTML页面的Item，则__repr__的重载可能如下所示：

class MyHTMLItem(Scrapy.Item):
    url = scrapy.Field()
    htmlcode = scrapy.Field()
    [...]
    def __repr__(self):
        s = ""
        s += "URL: %s\n" % self.get('URL')
        s += "Code (chunk): %s\n" % ((self.get('htmlcode'))[0:100])
        return s

Scrapy - 无声地放下一个物品

5 个答案: