我正在使用Python 3.6并遇到问题。我会解释。我有一个名为test_data_sample的数据框,其中有两个变量“用户”和“文本”。有两个不同的用户,但每个用户都写了几个文本。下面是一个示例:
User Text
user1 legit thank later
user1 I dont care
user2 Fried eggs
User3 it should be ok
User4 I do not like his assumptions
User4 I hate rugby
我有一个模型及其3个簇质心,我想计算每个“文本”与质心之间的距离。到目前为止的代码字,但我面临的问题是获取每个用户相似度的平均值。
user1 legit thank later
Distance to cluster 0.3
Distance to cluster 0.6
Distance to cluster 0.4
user1 I dont care
Distance to cluster 0.1
Distance to cluster 0.9
Distance to cluster 0.80
user2 Fried eggs
Distance to cluster 0.4
Distance to cluster 0.4
Distance to cluster 0.33
User3 it should be ok
Distance to cluster 0.4
Distance to cluster 0.54
Distance to cluster 0.6
User4 I do not like his assumptions
Distance to cluster 0.3
Distance to cluster 0.34
Distance to cluster 0.1
User4 I hate rugby
Distance to cluster 0.6
Distance to cluster 0.4
Distance to cluster 0.5
理想情况下,我希望某用户的输出如下:
user1 legit thank later
Distance to cluster 0.3
Distance to cluster 0.75
Distance to cluster 0.6
是每个距离的平均值。用户1具有两个“文本”,因此将每个群集的每个距离的总和除以2。用户4也将除以2,用户2和3保持原样。
期待您的回复。