我正在弄清楚是否可以使用MongoDB来帮助解决我们的存储和处理问题。我们的想法是,计算将以多处理的方式在每个节点上完成,并使用唯一的mongodb ObjectId 写入mongodb。下面的数据结构如下:
{a: {b: {c: [100, 200, 300]} }
a,b和c是整数键
当完成计算并将所有记录写入mongo时,必须组合文档,使得我们按顶级a分组,然后按b然后c。因此,两个文档可能包含(示例A ):
document1:{24: {67: {12: [100, 200]}}}
document2:{24: {68: {12: [100, 200]}}}
然后我们合并:
合并:{24: {67: {12: [100, 200]}, 68: [100, 200]}}
如果我们有另外两份文件( ExampleB ):
document1:{24: {67: {12: [100, 200]}}}
document2:{24: {67: {12: [300, 400]}}}
合并:{24: {67: {12: [100, 200, 300, 400]}}}
组合这些嵌套结构的最佳方法是什么。我可以手动循环遍历每个文档,并在python中说这个,但是有更聪明的方法吗?我需要保留基础数据结构。
答案 0 :(得分:2)
使用python进行聚合有什么不聪明的?考虑以下功能:
def aggregate(documents, base_document=None, unique=True):
# use unique=False to keep all values in the lists, even if repeated
# like [100, 100, 200, 300], leave it True otherwise
for doc in documents:
if isinstance(doc, list):
if base_document is None: base_document = []
for d in doc:
base_document.append(d)
if unique==True: base_document = set(base_document)
base_document = sorted(base_document)
else:
if base_document is None: base_document = {}
for d in doc:
b = base_document[d] if d in base_document \
else [] if isinstance(doc[d], list) else {}
base_document[d] = aggregate([doc[d]], base_document=b)
return base_document
使用以下文档集进行测试,它会生成聚合:
documents = [ {20: {55: { 7: [100, 200]}}},
{20: {68: {12: [100, 200]}}},
{20: {68: {12: [500, 200]}}},
{23: {67: {12: [100, 200]}}},
{23: {68: {12: [100, 200]}}},
{24: {67: {12: [300, 400]}}},
{24: {67: {12: [100, 200]}}},
{24: {67: {12: [100, 200]}}},
{24: {67: {12: [300, 500]}}},
{24: {67: {13: [600, 400]}}},
{24: {67: {13: [700, 900]}}},
{24: {68: {12: [100, 200]}}},
{25: {67: {12: [100, 200]}}},
{25: {67: {12: [300, 400]}}}, ]
from pprint import pprint
pprint(aggregate(documents))
'''
{20: {55: {7: [100, 200]}, 68: {12: [100, 200, 500]}},
23: {67: {12: [100, 200]}, 68: {12: [100, 200]}},
24: {67: {12: [100, 200, 300, 400, 500], 13: [400, 600, 700, 900]},
68: {12: [100, 200]}},
25: {67: {12: [100, 200, 300, 400]}}}
'''
答案 1 :(得分:0)
建立@chapelo:
##Import python mongodb API:
import pymongo
##Build aggregation framework:
def aggregate(documents, base_document=None, unique=True):
# use unique=False to keep all values in the lists, even if repeated
# like [100, 100, 200, 300], leave it True otherwise
for doc in documents:
if isinstance(doc, list):
if base_document is None: base_document = []
for d in doc:
base_document.append(d)
if unique==True: base_document = set(base_document)
base_document = sorted(base_document)
else:
if base_document is None: base_document = {}
for d in doc:
b = base_document[d] if d in base_document \
else [] if isinstance(doc[d], list) else {}
base_document[d] = aggregate([doc[d]], base_document=b)
return base_document
##Open mongodb connection:
db = pymongo.MongoClient()
##Query old documents without ObjectIds:
old_dict = db.old.collection.find({},{"_id":0})
##Run old documents through aggregation framework:
new_dict = aggregate(old_dict)
##Insert aggregated documents into new mongodb collection:
for i in new_dict:
db.new.collection.insert({i:new_dict[i]})
##Close mongodb connection:
db.close()