通过Python对具有类似描述的列进行聚类

时间:2017-09-06 21:05:37

标签: python-2.7 pandas scikit-learn nltk cluster-analysis

我的数据集中有一列有车祸说明。很多描述都是不一致的,但意思相同。例如,如果我考虑变量标记Descriptions的前7行(我的实际数据集是17,000多行):

Descriptions
CLMT REAR ENDED IV
claimant REAR ENDED IV
CLM'R EAR ENDED IV
4 way stop sgn
CLM'T  rear-ended IV
IV STOPPED AT RED LIGHT WAS REAR ENDED BY CLM'T
IV Stopped at red light when IV was R/E by OV

其中CLMT REAR ENDED IVclaimant REAR ENDED IV表示相同的内容,但它们的拼写略有不同。我想生成一个变量,将它们分组到同一个类别中。最终目标是这样的:

Descriptions                                    clusterGroup
CLMT REAR ENDED IV                                cluster1
claimant REAR ENDED IV                            cluster1
CLM'R EAR ENDED IV                                cluster1
4 way stop sgn                                    cluster2
CLM'T  rear-ended IV                              cluster1
IV STOPPED AT RED LIGHT WAS REAR ENDED BY CLM'T   cluster3
IV Stopped at red light when IV was R/E by OV     cluster3

我知道这是错的,我不知道如何使用scikit学习kmean如何使每行成为nltk句子然后集群:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd
import nltk

df = pd.read_csv('dataset.csv')
documents = df['Descriptions'].apply(nltk.sent_tokenize)    

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 50
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
predict=model.predict(X)   
df['clusterGroup'] = Series(predict, index=X.index)

当我运行上面的脚本时,我收到以下错误:

AttributeError: 'list' object has no attribute 'lower'

考虑到Descriptions pandas列中的每一行都是一个句子,我怎么可能使用nltk将这些分解为句子,我可以运行kmeans或其他一些聚类算法?任何帮助或指导协助将不胜感激

1 个答案:

答案 0 :(得分:0)

像k-means这样的无监督方法总是在这项任务上表现不佳。

因为它完全是关于理解语言,并且不会仅仅从这一小部分数据中发生。您可以通过语言的统计分析来做出令人印象深刻的事情(请参阅Google智能助理),但需要数十亿个文档进行培训。即便如此,它也可能涉及大量标记的训练数据。