Question

我在Windows 8.1 Pro上使用Python 2.7（Anaconda发行版）。我有一个包含各自主题的文章数据库。

我正在构建一个应用程序，用于查询数据库中的文本短语，并将文章主题与每个查询的短语相关联。主题是根据文章短语的相关性分配的。

瓶颈似乎是与本地主机的Python套接字通信。

以下是我的cProfile输出：

    topics_fit (PhraseVectorizer_1_1.py:668)
    function called 1 times

         1930698 function calls (1929630 primitive calls) in 148.209 seconds

    Ordered by: cumulative time, internal time, call count
    List reduced from 286 to 40 due to restriction <40>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.224    1.224  148.209  148.209 PhraseVectorizer_1_1.py:668(topics_fit)
    206272    0.193    0.000  146.780    0.001 cursor.py:1041(next)
      601    0.189    0.000  146.455    0.244 cursor.py:944(_refresh)
      534    0.030    0.000  146.263    0.274 cursor.py:796(__send_message)
      534    0.009    0.000  141.532    0.265 mongo_client.py:725(_send_message_with_response)
      534    0.002    0.000  141.484    0.265 mongo_client.py:768(_reset_on_error)
      534    0.019    0.000  141.482    0.265 server.py:69(send_message_with_response)
      534    0.002    0.000  141.364    0.265 pool.py:225(receive_message)
      535    0.083    0.000  141.362    0.264 network.py:106(receive_message)
     1070    1.202    0.001  141.278    0.132 network.py:127(_receive_data_on_socket)
     3340  140.074    0.042  140.074    0.042 {method 'recv' of '_socket.socket' objects}
      535    0.778    0.001    4.700    0.009 helpers.py:88(_unpack_response)
      535    3.828    0.007    3.920    0.007 {bson._cbson.decode_all}
       67    0.099    0.001    0.196    0.003 {method 'sort' of 'list' objects}
    206187    0.096    0.000    0.096    0.000 PhraseVectorizer_1_1.py:705(<lambda>)
    206187    0.096    0.000    0.096    0.000 database.py:339(_fix_outgoing)
    206187    0.074    0.000    0.092    0.000 objectid.py:68(__init__)
     1068    0.005    0.000    0.054    0.000 server.py:135(get_socket)
  1068/534    0.010    0.000    0.041    0.000 contextlib.py:21(__exit__)
      1068    0.004    0.000    0.041    0.000 pool.py:501(get_socket)
       534    0.003    0.000    0.028    0.000 pool.py:208(send_message)
       534    0.009    0.000    0.026    0.000 pool.py:573(return_socket)
       567    0.001    0.000    0.026    0.000 socket.py:227(meth)
      535    0.024    0.000    0.024    0.000 {method 'sendall' of '_socket.socket' objects}
      534    0.003    0.000    0.023    0.000 topology.py:134(select_server)
   206806    0.020    0.000    0.020    0.000 collection.py:249(database)
   418997    0.019    0.000    0.019    0.000 {len}
      449    0.001    0.000    0.018    0.000 topology.py:143(select_server_by_address)
      534    0.005    0.000    0.018    0.000 topology.py:82(select_servers)
     1068/534    0.001    0.000    0.018    0.000 contextlib.py:15(__enter__)
      534    0.002    0.000    0.013    0.000 thread_util.py:83(release)
   207307    0.010    0.000    0.011    0.000 {isinstance}
      534    0.005    0.000    0.011    0.000 pool.py:538(_get_socket_no_auth)
      534    0.004    0.000    0.011    0.000 thread_util.py:63(release)
      534    0.001    0.000    0.011    0.000 mongo_client.py:673(_get_topology)
      535    0.003    0.000    0.010    0.000 topology.py:57(open)
   206187    0.008    0.000    0.008    0.000 {method 'popleft' of 'collections.deque' objects}
      535    0.002    0.000    0.007    0.000 topology.py:327(_apply_selector)
      536    0.003    0.000    0.007    0.000 topology.py:286(_ensure_opened)
     1071    0.004    0.000    0.007    0.000 periodic_executor.py:50(open)

特别是： {方法'recv'的'_socket.socket'对象} 似乎会造成麻烦。

根据What can I do to improve socket performance in Python 3?中的建议，我尝试了gevent。

我在脚本的开头添加了这个代码段（在导入任何内容之前）：

from gevent import monkey
monkey.patch_all()

这导致表现更慢......

*** PROFILER RESULTS ***
topics_fit (PhraseVectorizer_1_1.py:671)
function called 1 times

         1956879 function calls (1951292 primitive calls) in 158.260 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 427 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  158.170  158.170 hub.py:358(run)
        1    0.000    0.000  158.170  158.170 {method 'run' of 'gevent.core.loop' objects}
      2/1    1.286    0.643  158.166  158.166 PhraseVectorizer_1_1.py:671(topics_fit)
   206272    0.198    0.000  156.670    0.001 cursor.py:1041(next)
      601    0.192    0.000  156.203    0.260 cursor.py:944(_refresh)
      534    0.029    0.000  156.008    0.292 cursor.py:796(__send_message)
      534    0.012    0.000  150.514    0.282 mongo_client.py:725(_send_message_with_response)
      534    0.002    0.000  150.439    0.282 mongo_client.py:768(_reset_on_error)
      534    0.017    0.000  150.437    0.282 server.py:69(send_message_with_response)
  551/535    0.002    0.000  150.316    0.281 pool.py:225(receive_message)
  552/536    0.079    0.000  150.314    0.280 network.py:106(receive_message)
1104/1072    0.815    0.001  150.234    0.140 network.py:127(_receive_data_on_socket)
2427/2395    0.019    0.000  149.418    0.062 socket.py:381(recv)
  608/592    0.003    0.000   48.541    0.082 socket.py:284(_wait)
      552    0.885    0.002    5.464    0.010 helpers.py:88(_unpack_response)
      552    4.475    0.008    4.577    0.008 {bson._cbson.decode_all}
     3033    2.021    0.001    2.021    0.001 {method 'recv' of '_socket.socket' objects}
      7/4    0.000    0.000    0.221    0.055 hub.py:189(_import)
        4    0.127    0.032    0.221    0.055 {__import__}
       67    0.104    0.002    0.202    0.003 {method 'sort' of 'list' objects}
  536/535    0.003    0.000    0.142    0.000 topology.py:57(open)
  537/536    0.002    0.000    0.139    0.000 topology.py:286(_ensure_opened)
1072/1071    0.003    0.000    0.138    0.000 periodic_executor.py:50(open)
  537/536    0.001    0.000    0.136    0.000 server.py:33(open)
  537/536    0.001    0.000    0.135    0.000 monitor.py:69(open)
    20/19    0.000    0.000    0.132    0.007 topology.py:342(_update_servers)
        4    0.000    0.000    0.131    0.033 hub.py:418(_get_resolver)
        1    0.000    0.000    0.122    0.122 resolver_thread.py:13(__init__)
        1    0.000    0.000    0.122    0.122 hub.py:433(_get_threadpool)
   206187    0.081    0.000    0.101    0.000 objectid.py:68(__init__)
   206187    0.100    0.000    0.100    0.000 database.py:339(_fix_outgoing)
   206187    0.098    0.000    0.098    0.000 PhraseVectorizer_1_1.py:708(<lambda>)
        1    0.073    0.073    0.093    0.093 threadpool.py:2(<module>)
     2037    0.003    0.000    0.092    0.000 hub.py:159(get_hub)
        2    0.000    0.000    0.090    0.045 thread.py:39(start_new_thread)
        2    0.000    0.000    0.090    0.045 greenlet.py:195(spawn)
        2    0.000    0.000    0.090    0.045 greenlet.py:74(__init__)
        1    0.000    0.000    0.090    0.090 hub.py:259(__init__)
     1102    0.004    0.000    0.078    0.000 pool.py:501(get_socket)
     1068    0.005    0.000    0.074    0.000 server.py:135(get_socket)

这个性能在我的应用程序中有点不可接受 - 我希望它更快（这是为20个文档的子集定时和分析，我需要处理几万个）。

关于如何加快速度的想法？

非常感谢。

编辑：我描述的代码段：

# also tried monkey patching all here, see profiler

from pymongo import MongoClient

def topics_fit(self):

    client = MongoClient()
    # tried motor for multithreading - also slow
    #client = motor.motor_tornado.MotorClient()

    # initialize DB cursors
    db_wiki = client.wiki

    # initialize topic feature dictionary
    self.topics = OrderedDict()
    self.topic_mapping = OrderedDict()

    vocabulary_keys = self.vocabulary.keys()

    num_categories = 0

    for phrase in vocabulary_keys:

        phrase_tokens = phrase.split()

        if len(phrase_tokens) > 1:

            # query for current phrase
            AND_phrase = "\"" + phrase + "\""

            cursor = db_wiki.categories.find({ "$text" : { "$search": AND_phrase } },{ "score": { "$meta": "textScore" } })
            cursor = list(cursor)

            if cursor:
                cursor.sort(key=lambda k: k["score"], reverse = True)
                added_categories = cursor[0]["category_ids"]
                for added_category in added_categories:
                    if not (added_category in self.topics):
                        self.topics[added_category] = num_categories
                        if not (self.vocabulary[phrase] in self.topic_mapping):
                            self.topic_mapping[self.vocabulary[phrase]] = [num_categories, ]
                        else:
                            self.topic_mapping[self.vocabulary[phrase]].append(num_categories)
                        num_categories+=1
                    else:
                        if not (self.vocabulary[phrase] in self.topic_mapping):
                            self.topic_mapping[self.vocabulary[phrase]] = [self.topics[added_category], ]
                        else:
                            self.topic_mapping[self.vocabulary[phrase]].append(self.topics[added_category])

编辑2：index_information（）的输出：

{u'_id_': 
    {u'ns': u'wiki.categories', u'key': [(u'_id', 1)], u'v': 1},  
    u'article_title_text_article_body_text_category_names_text': {u'default_language': u'english', u'weights': SON([(u'article_body', 1), (u'article_title', 1), (u'category_names', 1)]), u'key': [(u'_fts', u'text'), (u'_ftsx', 1)], u'v': 1, u'language_override': u'language', u'ns': u'wiki.categories', u'textIndexVersion': 2}}

提高MongoDB客户端（套接字）的性能

0 个答案: