Question

我正在尝试计算我抓取的一组文档中存在的每个单词的 IDF。我按照以下格式存储了所有信息：

{
  _id: 1245236476,
  url: https: //something1.com,
    words: {
      doctor: {
        count: 14,
        idf: 0.0
      },
      boss: {
        count: 43,
        idf: 0.0
      },
      teacher: {
        count: 89,
        idf: 0.0
      },
      .......
    },
},
{
  _id: 12346376,
  url: https: //something2.com,
    words: {
      admin: {
        count: 14,
        idf: 0.0
      },
      boss: {
        count: 43,
        idf: 0.0
      },
      student: {
        count: 89,
        idf: 0.0
      },
      .......
    },
},
.........
{
  _id: 57856376,
  url: https: //something3.com,
    words: {
      ads: {
        count: 14,
        idf: 0.0
      },
      web: {
        count: 43,
        idf: 0.0
      },
      teacher: {
        count: 89,
        idf: 0.0
      },
      .......
    },
}

我正在尝试计算文档集合中每个单词的出现次数。集合大小超过 3.5 GB。我编写了一个代码来检查实现是否正确，示例数据包含来自我的集合的 1000 个文档。为了实现这一点，代码如下：

from pymongo import MongoClient
from math import log

def merge(x, y):
    keys = list(y.keys())
    for key in keys:
        x[key] = x.get(key, 0) + 1
    return x

client = MongoClient('mongodb-uri')
pipeline = [
    {
        "$project": {
            "_id": 1,
            "words": 1
        }
    }, {"$limit": 1000}
]

data = list(client['db']['collection'].aggregate(pipeline=pipeline))
document_frequency = {}
for item in data:
    document_frequency = merge(document_frequency, item['words'])

documents = len(data)
idfs = {}
for key, value in document_frequency.items():
    idfs[key] = log(documents / value)

这段代码在大约一分钟内生成了这 1000 个文档中所有单词的 idf 输出。现在，当我在删除管道中的“$limit”阶段后尝试计算所有文档中单词的 idf 值时，出现内存错误。如何使用 pymongo API 甚至 MongoDB 聚合框架解决这个问题？有什么更好的方法可以解决这个问题？

Answer 1

使用 python iterate 可以减少内存使用。

def merge(x, y):
    for key in y:
        x[key] = x.get(key, 0) + 1
    return x

client = MongoClient('mongodb-uri')
total_doc = 0
document_frequency = {}
for doc in client['db']['collection'].find():
    total_doc += 1
    document_frequency = merge(document_frequency, doc['words'])
idfs = {}
for key, value in document_frequency.iteritems():
    idfs[key] = log(total_doc / value)

使用 pymongo 从 MongoDB 加载数据时出现内存错误

1 个答案: