我的数据集中有一列有车祸说明。很多描述都是不一致的,但意思相同。例如,如果我考虑变量标记Descriptions
的前7行(我的实际数据集是17,000多行):
Descriptions
CLMT REAR ENDED IV
claimant REAR ENDED IV
CLM'R EAR ENDED IV
4 way stop sgn
CLM'T rear-ended IV
IV STOPPED AT RED LIGHT WAS REAR ENDED BY CLM'T
IV Stopped at red light when IV was R/E by OV
其中CLMT REAR ENDED IV
和claimant REAR ENDED IV
表示相同的内容,但它们的拼写略有不同。我想生成一个变量,将它们分组到同一个类别中。最终目标是这样的:
Descriptions clusterGroup
CLMT REAR ENDED IV cluster1
claimant REAR ENDED IV cluster1
CLM'R EAR ENDED IV cluster1
4 way stop sgn cluster2
CLM'T rear-ended IV cluster1
IV STOPPED AT RED LIGHT WAS REAR ENDED BY CLM'T cluster3
IV Stopped at red light when IV was R/E by OV cluster3
我知道这是错的,我不知道如何使用scikit学习kmean如何使每行成为nltk句子然后集群:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd
import nltk
df = pd.read_csv('dataset.csv')
documents = df['Descriptions'].apply(nltk.sent_tokenize)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 50
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
predict=model.predict(X)
df['clusterGroup'] = Series(predict, index=X.index)
当我运行上面的脚本时,我收到以下错误:
AttributeError: 'list' object has no attribute 'lower'
考虑到Descriptions
pandas列中的每一行都是一个句子,我怎么可能使用nltk将这些分解为句子,我可以运行kmeans或其他一些聚类算法?任何帮助或指导协助将不胜感激
答案 0 :(得分:0)
像k-means这样的无监督方法总是在这项任务上表现不佳。
因为它完全是关于理解语言,并且不会仅仅从这一小部分数据中发生。您可以通过语言的统计分析来做出令人印象深刻的事情(请参阅Google智能助理),但需要数十亿个文档进行培训。即便如此,它也可能涉及大量标记的训练数据。