Question

我正在做多个集群项目。在我当前的群集项目中，我得到了与tfidf_vectorizer代码相关的错误。

以下是我导入的文档：

description_1 = open('description1.txt', 
encoding="utf8").read().lower().split('\n')
description_2 = open('description2.txt', 
encoding="utf8").read().lower().split('\n')
description_3 = open('description3.txt', 
encoding="utf8").read().lower().split('\n')
description_4 = open('description4.txt', 
encoding="utf8").read().lower().split('\n')
description_5 = open('description5.txt', 
encoding="utf8").read().lower().split('\n')
description_6 = open('description6.txt', 
encoding="utf8").read().lower().split('\n')
description_7 = open('description7.txt', 
encoding="utf8").read().lower().split('\n')

然后我合并了文档：

descriptions_on = (description_1, description_2, description_3, 
description_4, description_5, description_6, description_7)

descriptions = []

for i in range(len(descriptions_on)):
    item = descriptions_on[i]
    descriptions.append(item)

问题出现在这些代码行中的某处

from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
#from warnings import filterwarnings
#filterwarnings('ignore')

final_stopwords_list = list(fr_stop) + list(en_stop)

tfidf_vectorizer = TfidfVectorizer(max_df=0.90, max_features=200000,
                             min_df=0.10, stop_words=final_stopwords_list,
                             use_idf=True, tokenizer=tokenize_and_stem, 
ngram_range=(1,3))


%time tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)

tokenizer = tokenize_and_stem与已创建的函数“ tokenize_and_stem”相关，因为此问题很重要，因此未包含在此问题的代码列表中。

这是我从上面的代码中收到的错误消息：

AttributeError                            Traceback (most recent call 
last)
<timed exec> in <module>

D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in 
fit_transform(self, raw_documents, y)
   1611         """
   1612         self._check_params()
->     1613         X = super(TfidfVectorizer, 
self).fit_transform(raw_documents)
   1614         self._tfidf.fit(X)
   1615         # X is already a transformed view of raw_documents so

 D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in 
fit_transform(self, raw_documents, y)
   1029 
   1030         vocabulary, X = self._count_vocab(raw_documents,
 ->1031                                           self.fixed_vocabulary_)
   1032 
   1033         if self.binary:

D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in 
_count_vocab(self, raw_documents, fixed_vocab)
    941         for doc in raw_documents:
    942             feature_counter = {}
--> 943             for feature in analyze(doc):
    944                 try:
    945                     feature_idx = vocabulary[feature]

D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in 
<lambda>(doc)
    327                                                tokenize)
    328             return lambda doc: self._word_ngrams(
--> 329                 tokenize(preprocess(self.decode(doc))), 
stop_words)
    330 
    331         else:

 D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in 
<lambda>(x)
    255 
    256         if self.lowercase:
--> 257             return lambda x: strip_accents(x.lower())
    258         else:
    259             return strip_accents

AttributeError: 'list' object has no attribute 'lower'

很显然，我希望代码可以运行而不会出现任何错误。我已经尝试了多种方法来尝试解决此问题。我什至阅读了6-7篇在栈上有相同问题的帖子，但是每个帖子在某种程度上都与我的有所不同...

任何帮助将不胜感激。。谢谢！

编辑：

print(descriptions)

结果：

[['\ ufeff于1991年在蒙特利尔成立。dex是负担得起的奢侈品的真正先驱。“，”萨萨纳·桑德森（Sashana Sanderson）来宾，加勒比媒体与传播学院（carimac）第二年的新闻专业学生，位于大学里西印度群岛...”，“ diamentis为个性化医学提供了一种简单而可靠的解决方案，并且更加精确，diamentis正在开发首个心理保健诊断工具，这将使临床医生获得更多……”，“ dactacte授权培训供应商。几分钟之内即可在基于云的门户网站上创建在线课程。”，“ di-o-matic：发现您喜欢的cg角色背后的技术”，“专业人才的天才品质”，并向您保证”，“ dk-spec participe aucongrèsdemontréalsur le bois”，3月20日至22日，费尔蒙勒赖内－埃里萨贝特镇。 venez nous rencontrer！”，“ do networks limited的目标是成为具有光通信线路专业知识的企业的客户的一站式服务供应商，donetworks整个团队都致力于……”，“ douglas consultants inc。道格拉斯咨询公司和1999年特别组织架构顾问。”，“库珀目前是梦想，梦想办公室房地产投资信托，梦想全球房地产投资信托，梦想工业房地产投资信托和El Financial Corporation Limited的董事会成员。女士。 p。简·加文（Jane Gavan）是梦想资产管理部门的总裁，在房地产行业拥有30多年的经验。”，“ dromadairegéo-innovationsest firmespécialiséeengéomatique。au service de l'environnement，l'entreprise利用géo-localisationpour faire ... ...”，“专家专家，专家，专家，专家，专家和专家” ...”，“邓迪可持续技术（dst）从事环境友好技术的开发和商业化，以处理

中的材料

Answer 1

您的文档是list行，而不是字符串。

因此，它们没有正确的分词器格式。

与没有属性“较低”的“列表”对象有关的错误

1 个答案: