Question

我遇到有关ml功能包Countvectorizer不一致的问题。当我复制countvectorizer的结果和附带的词汇时，会收到不同的结果。

问题的根源在于，当我执行相同的模型（设置相等的种子）时，我会收到LDA的不同结果。

## Import packages
from pyspark.ml.feature import CountVectorizer , IDF

#compute first model
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=2.0)
model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = model.transform(tokenized_stopwords_sample_df)
vocabArray = model.vocabulary

#compute new model 
countVectors_new =  
CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=2.0)
model_new_cv = countVectors.fit(tokenized_stopwords_sample_df)
result_tf_new = model_new_cv.transform(tokenized_stopwords_sample_df)
vocabArray_new = model_new_cv.vocabulary

##Check if both vocabularies are the same
set(vocabArray_new) == set(vocabArray)
# Result: false

根据此结果，我看到尽管输入列相同，但ml包的计数矢量化器不会产生稳定且可复制的结果。有人可以帮忙，也可以为pyspark中的计算计数向量器提供替代方法吗？

计数向量化器的复制不一致

0 个答案: