我遇到有关ml功能包Countvectorizer不一致的问题。当我复制countvectorizer的结果和附带的词汇时,会收到不同的结果。
问题的根源在于,当我执行相同的模型(设置相等的种子)时,我会收到LDA的不同结果。
## Import packages
from pyspark.ml.feature import CountVectorizer , IDF
#compute first model
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=2.0)
model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = model.transform(tokenized_stopwords_sample_df)
vocabArray = model.vocabulary
#compute new model
countVectors_new =
CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=2.0)
model_new_cv = countVectors.fit(tokenized_stopwords_sample_df)
result_tf_new = model_new_cv.transform(tokenized_stopwords_sample_df)
vocabArray_new = model_new_cv.vocabulary
##Check if both vocabularies are the same
set(vocabArray_new) == set(vocabArray)
# Result: false
根据此结果,我看到尽管输入列相同,但ml包的计数矢量化器不会产生稳定且可复制的结果。 有人可以帮忙,也可以为pyspark中的计算计数向量器提供替代方法吗?