我正在尝试计算我抓取的一组文档中存在的每个单词的 IDF。我按照以下格式存储了所有信息:
{
_id: 1245236476,
url: https: //something1.com,
words: {
doctor: {
count: 14,
idf: 0.0
},
boss: {
count: 43,
idf: 0.0
},
teacher: {
count: 89,
idf: 0.0
},
.......
},
},
{
_id: 12346376,
url: https: //something2.com,
words: {
admin: {
count: 14,
idf: 0.0
},
boss: {
count: 43,
idf: 0.0
},
student: {
count: 89,
idf: 0.0
},
.......
},
},
.........
{
_id: 57856376,
url: https: //something3.com,
words: {
ads: {
count: 14,
idf: 0.0
},
web: {
count: 43,
idf: 0.0
},
teacher: {
count: 89,
idf: 0.0
},
.......
},
}
我正在尝试计算文档集合中每个单词的出现次数。集合大小超过 3.5 GB。我编写了一个代码来检查实现是否正确,示例数据包含来自我的集合的 1000 个文档。为了实现这一点,代码如下:
from pymongo import MongoClient
from math import log
def merge(x, y):
keys = list(y.keys())
for key in keys:
x[key] = x.get(key, 0) + 1
return x
client = MongoClient('mongodb-uri')
pipeline = [
{
"$project": {
"_id": 1,
"words": 1
}
}, {"$limit": 1000}
]
data = list(client['db']['collection'].aggregate(pipeline=pipeline))
document_frequency = {}
for item in data:
document_frequency = merge(document_frequency, item['words'])
documents = len(data)
idfs = {}
for key, value in document_frequency.items():
idfs[key] = log(documents / value)
这段代码在大约一分钟内生成了这 1000 个文档中所有单词的 idf 输出。现在,当我在删除管道中的“$limit”阶段后尝试计算所有文档中单词的 idf 值时,出现内存错误。如何使用 pymongo API 甚至 MongoDB 聚合框架解决这个问题?有什么更好的方法可以解决这个问题?
答案 0 :(得分:0)
使用 python iterate 可以减少内存使用。
def merge(x, y):
for key in y:
x[key] = x.get(key, 0) + 1
return x
client = MongoClient('mongodb-uri')
total_doc = 0
document_frequency = {}
for doc in client['db']['collection'].find():
total_doc += 1
document_frequency = merge(document_frequency, doc['words'])
idfs = {}
for key, value in document_frequency.iteritems():
idfs[key] = log(total_doc / value)