为什么将本地路径保存到对象的泡菜文件中

时间:2019-04-15 12:25:53

标签: python pickle rasa-nlu

我正在通过cloudpick将类的对象保存到* .pkl文件中。但是在保存的* .pkl文件中,我发现二进制字节中引用了我本地文件之一的完整路径

当我在RASA_NLU开源平台上工作时发现此问题,我的python版本是3.5.6。我已尝试使用google搜索并探索RASA_NLU源代码,但找不到根本原因。

CountVectorsFeaturizer类在/home//rasa_nlu/rasa_nlu/featurizers/count_vectors_featurizer.py

中定义
22 class CountVectorsFeaturizer(Featurizer):
........
........
138     def _tokenizer(self, text):
139         """Override tokenizer in CountVectorizer"""
140         text = re.sub(r'\b[0-9]+\b', '__NUMBER__', text)
141
142         token_pattern = re.compile(self.token_pattern)
143         tokens = token_pattern.findall(text)
144
145         if self.OOV_token:
146             if hasattr(self.vect, 'vocabulary_'):
147                 # CountVectorizer is trained, process for prediction
148                 if self.OOV_token in self.vect.vocabulary_:
149                     tokens = [
150                         t if t in self.vect.vocabulary_.keys()
151                         else self.OOV_token for t in tokens
152                     ]
153             elif self.OOV_words:
154                 # CountVectorizer is not trained, process for train
155                 tokens = [
156                     self.OOV_token if t in self.OOV_words else t
157                     for t in tokens
158                 ]
159
160         return tokens


183     def train(self, training_data, cfg=None, **kwargs):
184         # type: (TrainingData, RasaNLUModelConfig, **Any) -> None
185         """Take parameters from config and
186             construct a new count vectorizer using the sklearn framework."""
187         from sklearn.feature_extraction.text import CountVectorizer
188
189         spacy_nlp = kwargs.get("spacy_nlp")
190         if spacy_nlp is not None:
191             # create spacy lemma_ for OOV_words
192             self.OOV_words = [t.lemma_
193                               for w in self.OOV_words
194                               for t in spacy_nlp(w)]
195
196         self.vect = CountVectorizer(token_pattern=self.token_pattern,
197                                     strip_accents=self.strip_accents,
198                                     lowercase=self.lowercase,
199                                     stop_words=self.stop_words,
200                                     ngram_range=(self.min_ngram,
201                                                  self.max_ngram),
202                                     max_df=self.max_df,
203                                     min_df=self.min_df,
204                                     max_features=self.max_features,
205                                     tokenizer=self._tokenizer)
207         lem_exs = [self._get_message_text(example)
208                    for example in training_data.intent_examples]
209
210         self._check_OOV_present(lem_exs)
211
212         try:
213             # noinspection PyPep8Naming
214             X = self.vect.fit_transform(lem_exs).toarray()
215         except ValueError:
216             self.vect = None
217             return
218
219         for i, example in enumerate(training_data.intent_examples):
220             # create bag for each example
221             example.set("text_features",
222                   self._combine_with_existing_text_features(example, X[i]))

上面的类用于覆盖sklearn的现有CountVectorizer,并进行一些更改,如tokenzier,如您在205行看到的那样。经过训练,该类的实例对象将保存到* .pkl:

239     def persist(self, model_dir):
240         # type: (Text) -> Dict[Text, Any]
241         """Persist this model into the passed directory.
242         Returns the metadata necessary to load the model again."""
243
244         featurizer_file = os.path.join(model_dir, self.name + ".pkl")
245         utils.pycloud_pickle(featurizer_file, self)
246         return {"featurizer_file": self.name + ".pkl"}

但是在生成的* .pkl中,我发现本地文件的完整路径被保存在其中:

  4 MethodType~T~E~TR~Th,~L^N_fill_function~T~S~T(h,~L^O_make_skel_func~T~S~Th.~L^HCodeType~T~E~TR~T(K^BK^@K^DK^DK^CCtt^@j^Ad^Ad^B|^A~C^C}^At^@j^B~H^@j^C~C^A}^B|^Bj^D|^A~C^A}^C~H^@j^Erpt^F~H^@j^Gd^C~C^BrX~H^@j^E~H^@j^Gj^Hk^Frp~G^@f^Ad^Dd^E~D^H|^CD^@~C^A}^Cn^X~H^@j    rp~G^@f^Ad^Fd^E~D^H|^CD^@~C^A}^C|^CS^@~T(~L%Override tokenizer in CountVectorizer~T~L
  5 \b[0-9]+\b~T~L
  6 __NUMBER__~T~L^Kvocabulary_~Th8(K^AK^@K^BK^DK^SC&g^@|^@]^^}^A|^A~H^@j^@j^Aj^B~C^@k^Fr^\|^An^D~H^@j^C~Q^Bq^DS^@~T)(h^^h=~L^Dkeys~Th^Xt~T~L^B.0~T~L^At~T~F~T~L~Y/home/<my_local_path>/rasa_nlu/rasa_nlu/featurizers/count_vectors_featurizer.py~T~L
  7 <listcomp>~TK~VC^B^F^A~T~L^Dself~T~E~T)t~TR~T~L5CountVectorsFeaturizer._tokenizer.<locals>.<listcomp>~Th8(K^AK^@K^BK^DK^SC g^@|^@]^X}^A|^A~H^@j^@k^Fr^X~H^@j^An    ^B|^A~Q^Bq^DS^@~T)h^Yh^X~F~ThAhB~F~ThDhEK~\C^B^F^A~ThG~E~T)t~TR~Tt~T(~L^Bre~T~L^Csub~T~L^Gcompile~Th^G~L^Gfindall~Th^X~L^Ghasattr~Th^^h=h^Yt~T(hG~L^Dtext~Th^G~L^Ftokens~Tt~ThD~L
  8 _tokenizer~TK~JC^X^@^B^N^B^L^A
  9 ^B^F^A^L^B^N^B

我尝试打印生成的* .pkl的内容,这里是:

{'OOV_token': None,
 'OOV_words': [],
 'component_config': {'OOV_token': None,
                      'OOV_words': [],
                      'lowercase': True,
                      'max_df': 1.0,
                      'max_features': None,
                      'max_ngram': 2,
                      'min_df': 0.0,
                      'min_ngram': 1,
                      'name': 'intent_featurizer_count_vectors',
                      'stop_words': ['how',
                                     'what',
                                     'hows',
                                     'is',
                                     'the',
                                     'whats'],
                      'strip_accents': None,
                      'token_pattern': '(?u)\\b\\w\\w+\\b'},
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'max_ngram': 2,
 'min_df': 0.0,
 'min_ngram': 1,
 'stop_words': ['how', 'what', 'hows', 'is', 'the', 'whats'],
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.0,
        ngram_range=(1, 2), preprocessor=None,
        stop_words=['how', 'what', 'hows', 'is', 'the', 'whats'],
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<bound method CountVectorsFeaturizer._tokenizer of <rasa_nlu.featurizers.count_vectors_featurizer.CountVectorsFeaturizer object at 0x7ffff67f96a0>>,
        vocabulary=None)}

我试图理解为什么将本地路径保存在这里。我猜想这是由205行中“ tokenizer”的可调用延误引起的,但是我不知道为什么。

希望有人能帮助我,谢谢。

1 个答案:

答案 0 :(得分:0)

此问题在rasa的新版本中已得到修复(请参见here for the code)。因此,请考虑升级到Rasa 1.x,例如通过执行pip install rasa