我在Windows 8.1 Pro上使用Python 2.7(Anaconda发行版)。 我有一个包含各自主题的文章数据库。
我正在构建一个应用程序,用于查询数据库中的文本短语,并将文章主题与每个查询的短语相关联。主题是根据文章短语的相关性分配的。
瓶颈似乎是与本地主机的Python套接字通信。
以下是我的cProfile输出:
topics_fit (PhraseVectorizer_1_1.py:668)
function called 1 times
1930698 function calls (1929630 primitive calls) in 148.209 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 286 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.224 1.224 148.209 148.209 PhraseVectorizer_1_1.py:668(topics_fit)
206272 0.193 0.000 146.780 0.001 cursor.py:1041(next)
601 0.189 0.000 146.455 0.244 cursor.py:944(_refresh)
534 0.030 0.000 146.263 0.274 cursor.py:796(__send_message)
534 0.009 0.000 141.532 0.265 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 141.484 0.265 mongo_client.py:768(_reset_on_error)
534 0.019 0.000 141.482 0.265 server.py:69(send_message_with_response)
534 0.002 0.000 141.364 0.265 pool.py:225(receive_message)
535 0.083 0.000 141.362 0.264 network.py:106(receive_message)
1070 1.202 0.001 141.278 0.132 network.py:127(_receive_data_on_socket)
3340 140.074 0.042 140.074 0.042 {method 'recv' of '_socket.socket' objects}
535 0.778 0.001 4.700 0.009 helpers.py:88(_unpack_response)
535 3.828 0.007 3.920 0.007 {bson._cbson.decode_all}
67 0.099 0.001 0.196 0.003 {method 'sort' of 'list' objects}
206187 0.096 0.000 0.096 0.000 PhraseVectorizer_1_1.py:705(<lambda>)
206187 0.096 0.000 0.096 0.000 database.py:339(_fix_outgoing)
206187 0.074 0.000 0.092 0.000 objectid.py:68(__init__)
1068 0.005 0.000 0.054 0.000 server.py:135(get_socket)
1068/534 0.010 0.000 0.041 0.000 contextlib.py:21(__exit__)
1068 0.004 0.000 0.041 0.000 pool.py:501(get_socket)
534 0.003 0.000 0.028 0.000 pool.py:208(send_message)
534 0.009 0.000 0.026 0.000 pool.py:573(return_socket)
567 0.001 0.000 0.026 0.000 socket.py:227(meth)
535 0.024 0.000 0.024 0.000 {method 'sendall' of '_socket.socket' objects}
534 0.003 0.000 0.023 0.000 topology.py:134(select_server)
206806 0.020 0.000 0.020 0.000 collection.py:249(database)
418997 0.019 0.000 0.019 0.000 {len}
449 0.001 0.000 0.018 0.000 topology.py:143(select_server_by_address)
534 0.005 0.000 0.018 0.000 topology.py:82(select_servers)
1068/534 0.001 0.000 0.018 0.000 contextlib.py:15(__enter__)
534 0.002 0.000 0.013 0.000 thread_util.py:83(release)
207307 0.010 0.000 0.011 0.000 {isinstance}
534 0.005 0.000 0.011 0.000 pool.py:538(_get_socket_no_auth)
534 0.004 0.000 0.011 0.000 thread_util.py:63(release)
534 0.001 0.000 0.011 0.000 mongo_client.py:673(_get_topology)
535 0.003 0.000 0.010 0.000 topology.py:57(open)
206187 0.008 0.000 0.008 0.000 {method 'popleft' of 'collections.deque' objects}
535 0.002 0.000 0.007 0.000 topology.py:327(_apply_selector)
536 0.003 0.000 0.007 0.000 topology.py:286(_ensure_opened)
1071 0.004 0.000 0.007 0.000 periodic_executor.py:50(open)
特别是: {方法'recv'的'_socket.socket'对象} 似乎会造成麻烦。
根据What can I do to improve socket performance in Python 3?中的建议,我尝试了gevent。
我在脚本的开头添加了这个代码段(在导入任何内容之前):
from gevent import monkey
monkey.patch_all()
这导致表现更慢......
*** PROFILER RESULTS ***
topics_fit (PhraseVectorizer_1_1.py:671)
function called 1 times
1956879 function calls (1951292 primitive calls) in 158.260 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 427 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 158.170 158.170 hub.py:358(run)
1 0.000 0.000 158.170 158.170 {method 'run' of 'gevent.core.loop' objects}
2/1 1.286 0.643 158.166 158.166 PhraseVectorizer_1_1.py:671(topics_fit)
206272 0.198 0.000 156.670 0.001 cursor.py:1041(next)
601 0.192 0.000 156.203 0.260 cursor.py:944(_refresh)
534 0.029 0.000 156.008 0.292 cursor.py:796(__send_message)
534 0.012 0.000 150.514 0.282 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 150.439 0.282 mongo_client.py:768(_reset_on_error)
534 0.017 0.000 150.437 0.282 server.py:69(send_message_with_response)
551/535 0.002 0.000 150.316 0.281 pool.py:225(receive_message)
552/536 0.079 0.000 150.314 0.280 network.py:106(receive_message)
1104/1072 0.815 0.001 150.234 0.140 network.py:127(_receive_data_on_socket)
2427/2395 0.019 0.000 149.418 0.062 socket.py:381(recv)
608/592 0.003 0.000 48.541 0.082 socket.py:284(_wait)
552 0.885 0.002 5.464 0.010 helpers.py:88(_unpack_response)
552 4.475 0.008 4.577 0.008 {bson._cbson.decode_all}
3033 2.021 0.001 2.021 0.001 {method 'recv' of '_socket.socket' objects}
7/4 0.000 0.000 0.221 0.055 hub.py:189(_import)
4 0.127 0.032 0.221 0.055 {__import__}
67 0.104 0.002 0.202 0.003 {method 'sort' of 'list' objects}
536/535 0.003 0.000 0.142 0.000 topology.py:57(open)
537/536 0.002 0.000 0.139 0.000 topology.py:286(_ensure_opened)
1072/1071 0.003 0.000 0.138 0.000 periodic_executor.py:50(open)
537/536 0.001 0.000 0.136 0.000 server.py:33(open)
537/536 0.001 0.000 0.135 0.000 monitor.py:69(open)
20/19 0.000 0.000 0.132 0.007 topology.py:342(_update_servers)
4 0.000 0.000 0.131 0.033 hub.py:418(_get_resolver)
1 0.000 0.000 0.122 0.122 resolver_thread.py:13(__init__)
1 0.000 0.000 0.122 0.122 hub.py:433(_get_threadpool)
206187 0.081 0.000 0.101 0.000 objectid.py:68(__init__)
206187 0.100 0.000 0.100 0.000 database.py:339(_fix_outgoing)
206187 0.098 0.000 0.098 0.000 PhraseVectorizer_1_1.py:708(<lambda>)
1 0.073 0.073 0.093 0.093 threadpool.py:2(<module>)
2037 0.003 0.000 0.092 0.000 hub.py:159(get_hub)
2 0.000 0.000 0.090 0.045 thread.py:39(start_new_thread)
2 0.000 0.000 0.090 0.045 greenlet.py:195(spawn)
2 0.000 0.000 0.090 0.045 greenlet.py:74(__init__)
1 0.000 0.000 0.090 0.090 hub.py:259(__init__)
1102 0.004 0.000 0.078 0.000 pool.py:501(get_socket)
1068 0.005 0.000 0.074 0.000 server.py:135(get_socket)
这个性能在我的应用程序中有点不可接受 - 我希望它更快(这是为20个文档的子集定时和分析,我需要处理几万个)。
关于如何加快速度的想法?
非常感谢。
编辑: 我描述的代码段:
# also tried monkey patching all here, see profiler
from pymongo import MongoClient
def topics_fit(self):
client = MongoClient()
# tried motor for multithreading - also slow
#client = motor.motor_tornado.MotorClient()
# initialize DB cursors
db_wiki = client.wiki
# initialize topic feature dictionary
self.topics = OrderedDict()
self.topic_mapping = OrderedDict()
vocabulary_keys = self.vocabulary.keys()
num_categories = 0
for phrase in vocabulary_keys:
phrase_tokens = phrase.split()
if len(phrase_tokens) > 1:
# query for current phrase
AND_phrase = "\"" + phrase + "\""
cursor = db_wiki.categories.find({ "$text" : { "$search": AND_phrase } },{ "score": { "$meta": "textScore" } })
cursor = list(cursor)
if cursor:
cursor.sort(key=lambda k: k["score"], reverse = True)
added_categories = cursor[0]["category_ids"]
for added_category in added_categories:
if not (added_category in self.topics):
self.topics[added_category] = num_categories
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [num_categories, ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(num_categories)
num_categories+=1
else:
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [self.topics[added_category], ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(self.topics[added_category])
编辑2:index_information()的输出:
{u'_id_':
{u'ns': u'wiki.categories', u'key': [(u'_id', 1)], u'v': 1},
u'article_title_text_article_body_text_category_names_text': {u'default_language': u'english', u'weights': SON([(u'article_body', 1), (u'article_title', 1), (u'category_names', 1)]), u'key': [(u'_fts', u'text'), (u'_ftsx', 1)], u'v': 1, u'language_override': u'language', u'ns': u'wiki.categories', u'textIndexVersion': 2}}