我正在通过cloudpick将类的对象保存到* .pkl文件中。但是在保存的* .pkl文件中,我发现二进制字节中引用了我本地文件之一的完整路径
当我在RASA_NLU开源平台上工作时发现此问题,我的python版本是3.5.6。我已尝试使用google搜索并探索RASA_NLU源代码,但找不到根本原因。
CountVectorsFeaturizer类在/home//rasa_nlu/rasa_nlu/featurizers/count_vectors_featurizer.py
中定义22 class CountVectorsFeaturizer(Featurizer):
........
........
138 def _tokenizer(self, text):
139 """Override tokenizer in CountVectorizer"""
140 text = re.sub(r'\b[0-9]+\b', '__NUMBER__', text)
141
142 token_pattern = re.compile(self.token_pattern)
143 tokens = token_pattern.findall(text)
144
145 if self.OOV_token:
146 if hasattr(self.vect, 'vocabulary_'):
147 # CountVectorizer is trained, process for prediction
148 if self.OOV_token in self.vect.vocabulary_:
149 tokens = [
150 t if t in self.vect.vocabulary_.keys()
151 else self.OOV_token for t in tokens
152 ]
153 elif self.OOV_words:
154 # CountVectorizer is not trained, process for train
155 tokens = [
156 self.OOV_token if t in self.OOV_words else t
157 for t in tokens
158 ]
159
160 return tokens
183 def train(self, training_data, cfg=None, **kwargs):
184 # type: (TrainingData, RasaNLUModelConfig, **Any) -> None
185 """Take parameters from config and
186 construct a new count vectorizer using the sklearn framework."""
187 from sklearn.feature_extraction.text import CountVectorizer
188
189 spacy_nlp = kwargs.get("spacy_nlp")
190 if spacy_nlp is not None:
191 # create spacy lemma_ for OOV_words
192 self.OOV_words = [t.lemma_
193 for w in self.OOV_words
194 for t in spacy_nlp(w)]
195
196 self.vect = CountVectorizer(token_pattern=self.token_pattern,
197 strip_accents=self.strip_accents,
198 lowercase=self.lowercase,
199 stop_words=self.stop_words,
200 ngram_range=(self.min_ngram,
201 self.max_ngram),
202 max_df=self.max_df,
203 min_df=self.min_df,
204 max_features=self.max_features,
205 tokenizer=self._tokenizer)
207 lem_exs = [self._get_message_text(example)
208 for example in training_data.intent_examples]
209
210 self._check_OOV_present(lem_exs)
211
212 try:
213 # noinspection PyPep8Naming
214 X = self.vect.fit_transform(lem_exs).toarray()
215 except ValueError:
216 self.vect = None
217 return
218
219 for i, example in enumerate(training_data.intent_examples):
220 # create bag for each example
221 example.set("text_features",
222 self._combine_with_existing_text_features(example, X[i]))
上面的类用于覆盖sklearn的现有CountVectorizer,并进行一些更改,如tokenzier,如您在205行看到的那样。经过训练,该类的实例对象将保存到* .pkl:
239 def persist(self, model_dir):
240 # type: (Text) -> Dict[Text, Any]
241 """Persist this model into the passed directory.
242 Returns the metadata necessary to load the model again."""
243
244 featurizer_file = os.path.join(model_dir, self.name + ".pkl")
245 utils.pycloud_pickle(featurizer_file, self)
246 return {"featurizer_file": self.name + ".pkl"}
但是在生成的* .pkl中,我发现本地文件的完整路径被保存在其中:
4 MethodType~T~E~TR~Th,~L^N_fill_function~T~S~T(h,~L^O_make_skel_func~T~S~Th.~L^HCodeType~T~E~TR~T(K^BK^@K^DK^DK^CCtt^@j^Ad^Ad^B|^A~C^C}^At^@j^B~H^@j^C~C^A}^B|^Bj^D|^A~C^A}^C~H^@j^Erpt^F~H^@j^Gd^C~C^BrX~H^@j^E~H^@j^Gj^Hk^Frp~G^@f^Ad^Dd^E~D^H|^CD^@~C^A}^Cn^X~H^@j rp~G^@f^Ad^Fd^E~D^H|^CD^@~C^A}^C|^CS^@~T(~L%Override tokenizer in CountVectorizer~T~L
5 \b[0-9]+\b~T~L
6 __NUMBER__~T~L^Kvocabulary_~Th8(K^AK^@K^BK^DK^SC&g^@|^@]^^}^A|^A~H^@j^@j^Aj^B~C^@k^Fr^\|^An^D~H^@j^C~Q^Bq^DS^@~T)(h^^h=~L^Dkeys~Th^Xt~T~L^B.0~T~L^At~T~F~T~L~Y/home/<my_local_path>/rasa_nlu/rasa_nlu/featurizers/count_vectors_featurizer.py~T~L
7 <listcomp>~TK~VC^B^F^A~T~L^Dself~T~E~T)t~TR~T~L5CountVectorsFeaturizer._tokenizer.<locals>.<listcomp>~Th8(K^AK^@K^BK^DK^SC g^@|^@]^X}^A|^A~H^@j^@k^Fr^X~H^@j^An ^B|^A~Q^Bq^DS^@~T)h^Yh^X~F~ThAhB~F~ThDhEK~\C^B^F^A~ThG~E~T)t~TR~Tt~T(~L^Bre~T~L^Csub~T~L^Gcompile~Th^G~L^Gfindall~Th^X~L^Ghasattr~Th^^h=h^Yt~T(hG~L^Dtext~Th^G~L^Ftokens~Tt~ThD~L
8 _tokenizer~TK~JC^X^@^B^N^B^L^A
9 ^B^F^A^L^B^N^B
我尝试打印生成的* .pkl的内容,这里是:
{'OOV_token': None,
'OOV_words': [],
'component_config': {'OOV_token': None,
'OOV_words': [],
'lowercase': True,
'max_df': 1.0,
'max_features': None,
'max_ngram': 2,
'min_df': 0.0,
'min_ngram': 1,
'name': 'intent_featurizer_count_vectors',
'stop_words': ['how',
'what',
'hows',
'is',
'the',
'whats'],
'strip_accents': None,
'token_pattern': '(?u)\\b\\w\\w+\\b'},
'lowercase': True,
'max_df': 1.0,
'max_features': None,
'max_ngram': 2,
'min_df': 0.0,
'min_ngram': 1,
'stop_words': ['how', 'what', 'hows', 'is', 'the', 'whats'],
'strip_accents': None,
'token_pattern': '(?u)\\b\\w\\w+\\b',
'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=0.0,
ngram_range=(1, 2), preprocessor=None,
stop_words=['how', 'what', 'hows', 'is', 'the', 'whats'],
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=<bound method CountVectorsFeaturizer._tokenizer of <rasa_nlu.featurizers.count_vectors_featurizer.CountVectorsFeaturizer object at 0x7ffff67f96a0>>,
vocabulary=None)}
我试图理解为什么将本地路径保存在这里。我猜想这是由205行中“ tokenizer”的可调用延误引起的,但是我不知道为什么。
希望有人能帮助我,谢谢。
答案 0 :(得分:0)
此问题在rasa
的新版本中已得到修复(请参见here for the code)。因此,请考虑升级到Rasa 1.x
,例如通过执行pip install rasa
。