Question

我使用Scrapy抓取网站，我正在生成一个非常大的文档 - 有3个属性，其中一个是一个包含超过5000个对象的数组，每个属性都有一个对象中包含一些属性和小数组。总的来说，如果它被写入文件，它应该变成2MB以上，这实际上不是那么大。

在抓取对象后，我使用scrapy-mongodb管道将其插入数据库。每次，我都会收到错误信息：https://gist.github.com/ranisalt/ac572185e11e5918082b

（共有6个错误，每个对象1个错误，但爬虫输出太大而且已被删除）

那些无法编码的对象位于我在第一行提到的大型数组上。

什么可能使pymongo无法编码对象以及可能应用于我的文档？

如果有需要，请在评论中提问

Answer 1

您遇到的问题，我认为是由于在从Python插入mongoDB之前转义的字符未完全转换为 utf-8 格式。

我还没有检查MongoDB更改日志，但如果我没记错，因为v.2.2 +应该支持完整的unicode。

无论如何，您有两种方法，升级到较新版本的mongoDB 2.6，或修改/覆盖您的scrapy-mongodb脚本。要更改scrapy_mongodb.py，请在插入mongodb之前查看这些行， k 未转换为 utf-8 ：

# ... previous code ...
        key = {}
        if isinstance(self.config['unique_key'], list):
            for k in dict(self.config['unique_key']).keys():
                key[k] = item[k]
        else:
            key[self.config['unique_key']] = item[self.config['unique_key']]

        self.collection.update(key, item, upsert=True)
# ... and the rest ...

要解决此问题，您可以在process_item函数中添加以下几行：

# ... previous code ...
def process_item(self, item, spider):
    """ Process the item and add it to MongoDB
    :type item: Item object
    :param item: The item to put into MongoDB
    :type spider: BaseSpider object
    :param spider: The spider running the queries
    :returns: Item object
    """
    item = dict(self._get_serialized_fields(item))
    # add a recursive function to convert all unicode to utf-8 format
    # take this snippet from this [SO answer](http://stackoverflow.com/questions/956867/how-to-get-string-objects-instead-of-unicode-ones-from-json-in-python)
    def byteify(input):
        if isinstance(input, dict):
            return {byteify(key):byteify(value) for key,value in input.iteritems()}
        elif isinstance(input, list):
            return [byteify(element) for element in input]
        elif isinstance(input, unicode):
            return input.encode('utf-8')
            # if above utf-8 conversion still not working, replace them completely
            # return input.encode('ASCII', 'ignore')
        else:
            return input
    # finally replace the item with this function
    item = byteify(item)
    # ... rest of the code ... #

如果仍然无效，建议将mongodb升级到更新版本。

希望这有帮助。

BSON无法编码对象

1 个答案: