我想制作一个元素之间成对相似的数据框架。每个元素在数据框的shelves
列中都有标签。相似性得分基于元素标签。
这里是data
:
isbn title author year shelves
0 0380795272 Krondor: The Betrayal Raymond E. Feist 1998 [fantasy, raymond-e-feist, feist, epic-fantasy...
1 1416949658 The Dark Is Rising Susan Cooper 1973 [young-adult, fantasy, ya, childrens, series, ...
2 1857231082 The Black Unicorn Terry Brooks 1987 [fantasy, sci-fi-fantasy, series, magic, lando...
3 0553803700 I, Robot Isaac Asimov 1950 [science-fiction, sci-fi, classics, scifi, sho...
4 080213825X Four Blondes Candace Bushnell 2000 [chick-lit, chicklit, chic-lit, contemporary, ...
5 0375913750 Love, Stargirl Jerry Spinelli 2007 [young-adult, ya, realistic-fiction, romance, ...
6 074349671X The Tenth Circle Jodi Picoult 2006 [jodi-picoult, contemporary, chick-lit, adult-...
7 0743454553 Vanishing Acts Jodi Picoult 2005 [jodi-picoult, chick-lit, contemporary, drama,...
8 0765317508 Aztec Gary Jennings 1980 [historical-fiction, historical, history, mexi...
9 0142501085 Marlfox Brian Jacques 1998 [fantasy, redwall, young-adult, childrens, chi...
但是,如果我知道如何计算分数,就不知道如何通过创建对等相似度数据帧来添加分数。我用Jaccard's tanimoto score / index尝试了以下方法:
def tanimoto_score(shelves_1, shelves_2):
intersection_tanimoto = len([x for x in shelves_1 if x in shelves_2])
union_tanimoto = len(shelves_1) + len(shelves_2)
return(intersection_tanimoto/union_tanimoto)
similarity_df = pd.DataFrame()
# for each line we compare the tags
for index_1,row_1 in data.iterrows():
for index_2, row_2 in data.iterrows():
similarity_score = tanimoto_score(data.at[index_1,'shelves'],data.at[index_2,'shelves'])
similarity_df.loc[data.at[index_1,'title'],data.at[index_2,'title']] = similarity_score
它返回成对的similarity_df
,但是结果似乎是错误的,因为一个与自身的相似度仅为0.5:
Krondor: The Betrayal The Dark Is Rising The Black Unicorn I, Robot Four Blondes Love, Stargirl The Tenth Circle Vanishing Acts Aztec Marlfox
Krondor: The Betrayal 0.500000 0.119318 0.244318 0.164773 0.090909 0.068182 0.113636 0.096591 0.096591 0.136364
The Dark Is Rising 0.119318 0.500000 0.164773 0.159091 0.045455 0.232955 0.102273 0.085227 0.073864 0.238636
...
我想要类似的东西:
Krondor: The Betrayal The Dark Is Rising ...
Krondor: The Betrayal 1 0.75
The Dark Is Rising 0.75 1
...
因此:如何在数据框行的文本标签上创建成对相似矩阵?