Question

我使用 sklearn TfidfVectorizer 进行文字分类。

我知道这个矢量化器需要原始文本作为输入，但使用列表有效（参见input1）。

但是，如果我想使用多个列表（或集合），我会收到以下属性错误。

有谁知道如何解决这个问题？提前谢谢！

    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
    input1 = ["This", "is", "a", "test"]
    input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]]

    print(vectorizer.fit_transform(input1)) #works
    print(vectorizer.fit_transform(input2)) #gives Attribute error

input 1:
  (3, 0)    1.0

input 2:

回溯（最近一次呼叫最后一次）：文件＆＃34;＆＃34;，第1行，in 文件＆＃34; /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py" ;, 第1381行，在fit_transform中 X = super（TfidfVectorizer，self）.fit_transform（raw_documents）文件＆＃34; /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py" ;, 第869行，在fit_transform中 self.fixed_vocabulary_）File＆＃34; /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py"，第792行，在_count_vocab中对于分析中的功能（doc）：File＆＃34; /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py"，第266行，in tokenize（preprocess（self.decode（doc））），stop_words）File＆＃34; /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py& ＃34 ;, 232行，in return lambda x：strip_accents（x.lower（））AttributeError：＆＃39; list＆＃39;对象没有属性＆＃39; lower＆＃39;

Answer 1

请注意，input1有效，但它会将列表（字符串）的每个元素视为要矢量化的不同文档。

在input2的情况下，我假设你想要对每个＆＃34;句子进行矢量化＆＃34; （子列表）。一种解决方案是使用以下列表推导语法：

input2_corrected = [" ".join(x) for x in input2]

产生

['This is a test', 'It is raining today']

不再产生AttributeError。

如何为TfidfVectorizer使用列表列表或集合列表？

1 个答案: