我正在尝试在不同学生之间执行食物量向量的余弦相似性。我有一个包含以下内容的CSV文件:
Student food amount
John apple 15
John banana 20
John orange 1
John grape 3
Ben apple 2
Ben orange 4
Ben strawberry 8
Andrew apple 10
Andrew watermelon 3
以下代码:
import csv
from collections import defaultdict
data = defaultdict(dict)
with open('data.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
data[row['Student']][row['food']] = row['amount']
给我一个这样的结构:
{'John': {'apple': 15, 'banana': 20, 'orange': 1, 'grape': 3},
'Ben': {'apple': 2, 'orange': 4, 'strawberry': 8}, #etc.
}
我想把这些词典变成向量,其中向量的长度是唯一食物的数量,而学生不吃的食物项目将默认为0,以便:
for John: [15,20,1,3,0] corresponds to [apple,banana,orange,grape,strawberry,watermelon]
for Ben: [2,0,4,0,8,0] corresponds to [apple,banana,orange,grape,strawberry,watermelon] #etc
然后我会在每个学生之间输出余弦相似度矩阵。 感谢您抽出宝贵时间阅读。任何帮助将不胜感激。
答案 0 :(得分:0)
>>> D = {'John': {'apple': 15, 'banana': 20, 'orange': 1, 'grape': 3},
... 'Ben': {'apple': 2, 'orange': 4, 'strawberry': 8}, #etc.
... }
首先列出所有唯一键
>>> all_keys = sorted({k for i in D for k in D[i]})
>>> all_keys
['apple', 'banana', 'grape', 'orange', 'strawberry']
现在,您可以为每个人循环使用这些键
>>> {k:[D[k].get(i, 0) for i in all_keys] for k in D}
{'John': [15, 20, 3, 1, 0], 'Ben': [2, 0, 0, 4, 8]}