AttributeError:“列表”对象没有属性“较低”:聚类

时间:2018-07-24 11:41:01

标签: python python-3.x pandas scikit-learn

我正在尝试进行聚类。我正在处理熊猫和sklearn。

import pandas
import pprint
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction.text import TfidfVectorizer

dataset = pandas.read_csv('text.csv', encoding='utf-8')

dataset_list = dataset.values.tolist()


vectors = TfidfVectorizer()
X = vectors.fit_transform(dataset_list)

clusters_number = 20

model = KMeans(n_clusters = clusters_number, init = 'k-means++', max_iter = 300, n_init = 1)

model.fit(X)

centers = model.cluster_centers_
labels = model.labels_

clusters = {}
for comment, label in zip(dataset_list, labels):
    print ('Comment:', comment)
    print ('Label:', label)

try:
    clusters[str(label)].append(comment)
except:
    clusters[str(label)] = [comment]
pprint.pprint(clusters)

但是,即使我从未使用过lower(),我仍然遇到以下错误:

File "clustering.py", line 19, in <module>
    X = vetorizer.fit_transform(dataset_list)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

我不明白,我的文本(text.csv)已经小写了。而且我从来没有叫lower()

数据:

  

您好希望取消订单,谢谢您确认

     

你好想取消今天的商店世界

     

不兼容的尺寸床想知道如何通过今天取消发送的取消退款

     

您可能会诚意取消订单

     

您好要取消订单请求退款

     

您好希望取消此订单可以诚挚地表示正在处理

     

您好,日期交货后想取消订单谢谢

     

您好想要取消匹配订单的交货期为n 111111

     

您想取消此订单

     

你好订购的产品商店取消活动doublon提前谢谢

     

您好希望取消订单谢谢您的退款

     

您可能取消订单,请提前致谢

1 个答案:

答案 0 :(得分:1)

此行中的错误:

dataset_list = dataset.values.tolist()

您会看到dataset是一个熊猫DataFrame,所以当您进行dataset.values时,它将转换为形状为(n_rows,1)的2维数据集(即使列数为是1)。然后对此调用tolist()将产生一个列表列表,如下所示:

print(dataset_list)

[[hello wish to cancel order thank you confirmation],
 [hello would like to cancel order made today store house world],
 [dimensions bed not compatible would like to know how to pass cancellation refund send today cordially]
 ...
 ...
 ...]]

如您所见,这里有两个方括号。

现在TfidfVectorizer仅需要一个句子列表,而不是列表列表,因此会出现错误(因为TfidfVectorizer假定内部数据是句子,但这里是一个列表)。

所以您只需要这样做:

# Use ravel to convert 2-d to 1-d array
dataset_list = dataset.values.ravel().tolist()

OR

# Replace `column_name` with your actual column header, 
# which converts DataFrame to Series
dataset_list = dataset['column_name'].values).tolist()