countvectorizer线程安全吗?

时间:2018-11-25 08:50:41

标签: python scikit-learn thread-safety countvectorizer

我在我的项目中使用sklearn countvectorizer(4词ngram),在其中加载一个已经拟合的模型并使用它来转换文本。 我在项目中使用的方法是:transforminverse_transformvocabulary_.keys()

在使用它的同时,我意识到inverse_transform需要时间(实例大约为2.7秒),而整个过程大约需要3秒。之后,我将模型插入一个使用许多线程(约30个线程)的类中,以便将该模型与其他模型一起使用,我只检查了此过程所花费的时间,发现该时间可以延长到15秒。 / p>

我的问题是:

  1. 为什么该过程在课堂上花费更多时间?

  2. 计数矢量化器线程安全吗?我找不到这个问题的明确答案

  3. 如果我用其他方法替换了inverse_transform方法,这会使类中的过程更快吗?

谢谢

编辑

这是代码

for true_label, text in zip(labels, texts):
   splitted_text = text.split(" ")  # split text to remove features
   boolean_text = [False] * len(splitted_text)  # boolean list where the ith element is for the ith feature

   ngrams = list(map(lambda x: ' '.join(x), zip(splitted_text, splitted_text[1:], splitted_text[2:], splitted_text[3:])))
   doc_vocab = set(ngrams)
   non_religious_tokens = doc_vocab - set(vectorizer.vocabulary_.keys())
   print("non religious tokens:", non_religious_tokens)

   if len(splitted_text) > 3 and non_religious_tokens == set():
       print("Religious")
       print("--- prediction time: %s seconds ---" % (time.time() - start_time))
       continue

   print("******* 2nd Filter ********")
   # transform text
   vectorized_data = vectorizer.transform([text])
   terms = vectorizer.inverse_transform(vectorized_data)  # get features of text
   terms = terms[0]

   maybe_religious = not (len(terms) == 0)
   # for each feature
   for term in terms:

       splitted_term = term.split(" ")  # split term (splitted to four pieces in out case)

       # find 1st word index in text
       # match the other 3 words
       # set their indecies to true
       for i, word in enumerate(splitted_text):
           try:
               # find the piece of text that match this feature and highlight all the words as true (so they're removed
            #  later)
               if splitted_term[0] == splitted_text[i] and splitted_term[1] == splitted_text[i+1] and \
                    splitted_term[2] == splitted_text[i+2] and splitted_term[3] == splitted_text[i+3]:

                   boolean_text[i] = True
                   boolean_text[i+1] = True
                   boolean_text[i+2] = True
                   boolean_text[i+3] = True
           except IndexError:
               pass



   non_religious_text = ""
   # store non religious text (didn't match any feature in the text)
   for i, b in enumerate(boolean_text):
       if b is False:
           non_religious_text += " " + splitted_text[i]

   non_religious_text = remove_religious_stopwords(non_religious_text)

   # if all the features are removed, it's religious!
   if maybe_religious and (non_religious_text == "" or non_religious_text is None):
       print("Religious")
   # if the number of words remaining are lower than the limit
   # it's religious
   elif maybe_religious and (len(non_religious_text.split(" "))-1 <= max_non_religious):
       print("Religious")
   # other of that it's non religious
   else:
       print("Non Religious")


   print("--- prediction time: %s seconds ---" % (time.time() - start_time))

0 个答案:

没有答案