“字符串索引必须是整数” TF-IDF矢量化器

时间:2018-10-02 15:37:24

标签: python-3.x pandas scikit-learn tfidfvectorizer

我正在使用病态学习进行TF IDF,并且能够运行该功能,直到将病态学习升级到2.0.0版本为止。 现在出现以下错误:

TypeError: string indices must be integers

即使我没有更改代码中的任何内容!

from scipy.sparse import hstack, csr_matrix
print("\n[TF-IDF] Term Frequency Inverse Document Frequency Stage")
english_stop = set(stopwords.words("english"))

tfidf_para = {
    "stop_words": english_stop,
    "analyzer": "word",
    "token_pattern": r'\w{1,}',
    "sublinear_tf": True,
    "dtype": np.float32,
    "norm": "l2",
    #"min_df":5,
    #"max_df":.9,
    #"use_idf ":False,
    "smooth_idf":False
}
def get_col(col_name): return lambda x: x[col_name]
vectorizer = FeatureUnion([
        ("description",TfidfVectorizer(
            ngram_range=(1, 2),
            max_features=16000,
            **tfidf_para,
            use_idf =False,
            preprocessor=get_col("description"))),
        ("title",TfidfVectorizer(
            ngram_range=(1, 2),
            **tfidf_para,
            use_idf =False,
            #max_features=7000,
            preprocessor=get_col("title")))
    ])

start_vect=time.time()
vectorizer.fit(df.loc[df.index,:].to_dict("records"))
ready_df = vectorizer.transform(df.to_dict("records"))
tfvocab = vectorizer.get_feature_names()
print("Vectorization Runtime: %0.2f Minutes"%((time.time() - start_vect)/60))

这是我的字典格式的一个示例:

[{'title': 'title1',
  'description': 'description1'},
 {'title': 'title2 ',
  'description': 'description2'}]

你们对我在这里缺少什么有任何见识吗? 谢谢 ! :)

0 个答案:

没有答案