Question

我在我的项目中使用sklearn countvectorizer（4词ngram），在其中加载一个已经拟合的模型并使用它来转换文本。我在项目中使用的方法是：transform，inverse_transform，vocabulary_.keys()。

在使用它的同时，我意识到inverse_transform需要时间（实例大约为2.7秒），而整个过程大约需要3秒。之后，我将模型插入一个使用许多线程（约30个线程）的类中，以便将该模型与其他模型一起使用，我只检查了此过程所花费的时间，发现该时间可以延长到15秒。 / p>

我的问题是：

为什么该过程在课堂上花费更多时间？
计数矢量化器线程安全吗？我找不到这个问题的明确答案
如果我用其他方法替换了inverse_transform方法，这会使类中的过程更快吗？

谢谢

编辑

这是代码

for true_label, text in zip(labels, texts):
   splitted_text = text.split(" ")  # split text to remove features
   boolean_text = [False] * len(splitted_text)  # boolean list where the ith element is for the ith feature

   ngrams = list(map(lambda x: ' '.join(x), zip(splitted_text, splitted_text[1:], splitted_text[2:], splitted_text[3:])))
   doc_vocab = set(ngrams)
   non_religious_tokens = doc_vocab - set(vectorizer.vocabulary_.keys())
   print("non religious tokens:", non_religious_tokens)

   if len(splitted_text) > 3 and non_religious_tokens == set():
       print("Religious")
       print("--- prediction time: %s seconds ---" % (time.time() - start_time))
       continue

   print("******* 2nd Filter ********")
   # transform text
   vectorized_data = vectorizer.transform([text])
   terms = vectorizer.inverse_transform(vectorized_data)  # get features of text
   terms = terms[0]

   maybe_religious = not (len(terms) == 0)
   # for each feature
   for term in terms:

       splitted_term = term.split(" ")  # split term (splitted to four pieces in out case)

       # find 1st word index in text
       # match the other 3 words
       # set their indecies to true
       for i, word in enumerate(splitted_text):
           try:
               # find the piece of text that match this feature and highlight all the words as true (so they're removed
            #  later)
               if splitted_term[0] == splitted_text[i] and splitted_term[1] == splitted_text[i+1] and \
                    splitted_term[2] == splitted_text[i+2] and splitted_term[3] == splitted_text[i+3]:

                   boolean_text[i] = True
                   boolean_text[i+1] = True
                   boolean_text[i+2] = True
                   boolean_text[i+3] = True
           except IndexError:
               pass



   non_religious_text = ""
   # store non religious text (didn't match any feature in the text)
   for i, b in enumerate(boolean_text):
       if b is False:
           non_religious_text += " " + splitted_text[i]

   non_religious_text = remove_religious_stopwords(non_religious_text)

   # if all the features are removed, it's religious!
   if maybe_religious and (non_religious_text == "" or non_religious_text is None):
       print("Religious")
   # if the number of words remaining are lower than the limit
   # it's religious
   elif maybe_religious and (len(non_religious_text.split(" "))-1 <= max_non_religious):
       print("Religious")
   # other of that it's non religious
   else:
       print("Non Religious")


   print("--- prediction time: %s seconds ---" % (time.time() - start_time))

countvectorizer线程安全吗？

0 个答案: