我有一个mongo数据库,它有一个名为documents的集合。这个集合有436333个条目,我必须对它们进行一些操作。我的代码是:
client = MongoClient()
db = client['database']
documents = db.documents.find()
bias = {
'title': pd.DataFrame(index=POS_TAGS, dtype=np.int32),
'abstract': pd.DataFrame(index=POS_TAGS, dtype=np.int32),
'title_abstract': pd.DataFrame(index=POS_TAGS, dtype=np.int32)}
for i, doc in enumerate(documents):
pos_tags = doc['agg_pos_tagging']
for k, v in pos_tags.items():
if len(v) != 0:
bias[k].loc[:, doc["id"]] = \
pd.Series(_match_values(v), index=POS_TAGS)
else:
bias[k].loc[:, doc["id"]] = \
pd.Series([0] * len(POS_TAGS), index=POS_TAGS)
if i % 10000 == 0:
logging.info("{0} docs added".format(i))
我有CentOS 6.7和mongodb 2.6。我使用python 3.4,pymongo 3.2.2(安装了c扩展)和pandas 0.18.0。程序和服务器在同一台机器上。该集合以其id为索引。
问题在于,每次检索10000个条目所需的时间会随着检索到的行数而增加。这是日志:
2016-06-12 16:11:17,016 : INFO : 0 docs added
2016-06-12 16:13:45,553 : INFO : 10000 docs added
2016-06-12 16:17:47,117 : INFO : 20000 docs added
2016-06-12 16:23:14,786 : INFO : 30000 docs added
2016-06-12 16:29:40,412 : INFO : 40000 docs added
2016-06-12 16:38:11,807 : INFO : 50000 docs added
2016-06-12 16:50:38,987 : INFO : 60000 docs added
2016-06-12 17:04:19,188 : INFO : 70000 docs added
2016-06-12 17:18:21,669 : INFO : 80000 docs added
2016-06-12 17:34:18,687 : INFO : 90000 docs added
2016-06-12 17:53:11,497 : INFO : 100000 docs added
2016-06-12 18:18:57,503 : INFO : 110000 docs added
2016-06-12 18:51:33,503 : INFO : 120000 docs added
2016-06-12 19:24:47,799 : INFO : 130000 docs added
2016-06-12 19:57:40,690 : INFO : 140000 docs added
2016-06-12 20:31:44,103 : INFO : 150000 docs added
2016-06-12 21:10:24,900 : INFO : 160000 docs added
2016-06-12 22:02:46,849 : INFO : 170000 docs added
2016-06-12 22:57:50,108 : INFO : 180000 docs added
2016-06-12 23:55:52,541 : INFO : 190000 docs added
2016-06-13 00:56:43,676 : INFO : 200000 docs added
2016-06-13 02:07:39,460 : INFO : 210000 docs added
2016-06-13 03:22:43,074 : INFO : 220000 docs added
2016-06-13 04:40:00,819 : INFO : 230000 docs added
2016-06-13 06:05:09,572 : INFO : 240000 docs added
2016-06-13 07:27:09,148 : INFO : 250000 docs added
2016-06-13 08:58:45,093 : INFO : 260000 docs added
2016-06-13 10:13:26,832 : INFO : 270000 docs added
2016-06-13 11:30:29,821 : INFO : 280000 docs added
2016-06-13 12:53:54,008 : INFO : 290000 docs added
2016-06-13 14:20:45,617 : INFO : 300000 docs added
2016-06-13 16:01:00,446 : INFO : 310000 docs added
2016-06-13 17:40:26,558 : INFO : 320000 docs added
2016-06-13 19:30:14,056 : INFO : 330000 docs added
2016-06-13 21:19:33,698 : INFO : 340000 docs added
2016-06-13 23:10:49,665 : INFO : 350000 docs added
将文档340000添加到350000需要将近两个小时,但添加第一个10000只需要2分钟。
我不认为计算存在问题。对于每件10000件商品,他们应该花费大约相同的时间。我查看了数据库的日志文件,它没有显示所有getMore
操作(我添加了所有nreturns,它并不对应于检索到的文档数量),所以我没有&#39我知道我是否应该相信它。
我尝试将cursor_type
更改为EXHAUST
,但它没有改变结果。
我可以做些什么来加快速度?
修改 我没有任何计算就运行代码,到目前为止我得到了这个。
2016-06-14 15:43:19,263 : INFO : 0 docs added
2016-06-14 15:43:23,488 : INFO : 10000 docs added
2016-06-14 15:43:27,532 : INFO : 20000 docs added
2016-06-14 15:43:32,112 : INFO : 30000 docs added
2016-06-14 15:43:36,098 : INFO : 40000 docs added
2016-06-14 15:44:13,818 : INFO : 50000 docs added
2016-06-14 15:44:52,624 : INFO : 60000 docs added
2016-06-14 15:45:48,415 : INFO : 70000 docs added
2016-06-14 15:47:10,645 : INFO : 80000 docs added
2016-06-14 15:48:26,565 : INFO : 90000 docs added
2016-06-14 15:48:57,270 : INFO : 100000 docs added
2016-06-14 15:49:52,323 : INFO : 110000 docs added
2016-06-14 15:51:22,808 : INFO : 120000 docs added
2016-06-14 15:53:30,554 : INFO : 130000 docs added
2016-06-14 15:55:31,960 : INFO : 140000 docs added
仍有一些滞后随着时间的推移而增加。