我有多个mongodb集合,每个集合中有100000个文档,每个文档中有10000列。 有一个python脚本以多线程方式执行聚合查询。每个线程在单独的集合上调用集合。 这是相应的代码:
import pymongo
import time
import threading
import sys
mongocli = pymongo.MongoClient(host="192.168.99.100", username="admin", password="admin123", \
authSource="analyse_db")
db = mongocli['analyse_db']
collection = sys.argv[1]
column = sys.argv[2]
threaded = int(sys.argv[3])
# Start times of time column of every collection.
# There are 7 collections for each day in a week.
start_times = [1587288313, 1587374713, 1587461113, 1587547513, 1587633514, 1587719914, 1587806314]
interval = 86000
time_column = "timestamp_EP"
col = db[collection]
group_query = {"$group": {"_id": "", "result": {"$avg": "$" + column}}}
def aggregate_thread(local_col, start_time, end_time, thread_id):
local_mongocli = pymongo.MongoClient(host="192.168.99.100", username="admin", password="admin123", \
authSource="analyse_db")
local_db = local_mongocli['analyse_db']
local_col_obj = local_db[local_col]
#cursor = local_col.aggregate([{"$match": {time_column: \
# {"$gte": start_time, "$lt": end_time}}}, group_query])
thr_start_time = time.time()
cursor = local_col_obj.aggregate([group_query])
try:
print(cursor.next())
except Exception:
print("no data")
print("thread " + str(thread_id) + " completion time: ", time.time() - thr_start_time)
def aggregate_threaded():
thr_list = []
nthreads = len(start_times)
for i in range(0, nthreads):
ftime = start_times[i]
ttime = ftime + interval
local_col = collection + "_day" + str(i)
thr = threading.Thread(target=aggregate_thread, args=[local_col, ftime, ttime, i], daemon=True)
thr_list.append(thr)
thr.start()
for thr in thr_list:
_res = thr.join()
proc_start = time.time()
aggregate_threaded()
print(time.time() - proc_start)
现在,执行脚本时,完成聚合所花费的时间与线程数成正比。即,延迟随着并发执行的查询数线性增加。 这是脚本的结果(7个线程在不同的集合上执行聚合查询):
C:\Users\AJINKYA\itanta\live_data>python perf_test3.py Archive-LiveDataLog5 tag1 1
{'_id': '', 'result': 50.054371460796965}
thread 3 completion time: 140.73034977912903
{'_id': '', 'result': 50.20921849745933}
thread 5 completion time: 146.46782159805298
{'_id': '', 'result': 50.064871338705366}
thread 4 completion time: 147.66157269477844
{'_id': '', 'result': 50.17267241078592}
thread 1 completion time: 151.08023619651794
{'_id': '', 'result': 49.85328077580493}
thread 2 completion time: 151.3344430923462
{'_id': '', 'result': 49.993023336937945}
thread 0 completion time: 151.4148395061493
{'_id': '', 'result': 49.89189660585342}
thread 6 completion time: 151.54917550086975
151.57819509506226
因此7个线程花费了151秒完成聚合。
另一方面,如果只有单个线程在单个集合上执行聚合,则所需的时间要少得多。如果以上脚本中的start_times
仅具有一个元素,则结果如下:
C:\Users\AJINKYA\itanta\live_data>python perf_test3.py Archive-LiveDataLog5 tag1 1
{'_id': '', 'result': 49.993023336937945}
thread 0 completion time: 43.36866021156311
43.37059950828552
如果只有一个汇总查询,则只需43秒。 我的期望是,多线程查询将花费与单线程查询大致相同的时间。 现在很明显,在mongodb中不会并发执行多个查询。 这是mongodb的已知限制吗? mongodb中是否有任何配置参数可以控制并发?