在数据框的行的文本标签上创建成对相似性矩阵

时间:2019-05-20 08:29:31

标签: python python-3.x pandas dataframe similarity

我想制作一个元素之间成对相似的数据框架。每个元素在数据框的shelves列中都有标签。相似性得分基于元素标签。

这里是data

    isbn        title                   author              year    shelves
0   0380795272  Krondor: The Betrayal   Raymond E. Feist    1998    [fantasy, raymond-e-feist, feist, epic-fantasy...
1   1416949658  The Dark Is Rising      Susan Cooper    1973    [young-adult, fantasy, ya, childrens, series, ...
2   1857231082  The Black Unicorn       Terry Brooks    1987    [fantasy, sci-fi-fantasy, series, magic, lando...
3   0553803700  I, Robot                Isaac Asimov    1950    [science-fiction, sci-fi, classics, scifi, sho...
4   080213825X  Four Blondes            Candace Bushnell    2000    [chick-lit, chicklit, chic-lit, contemporary, ...
5   0375913750  Love, Stargirl          Jerry Spinelli  2007    [young-adult, ya, realistic-fiction, romance, ...
6   074349671X  The Tenth Circle        Jodi Picoult    2006    [jodi-picoult, contemporary, chick-lit, adult-...
7   0743454553  Vanishing Acts          Jodi Picoult    2005    [jodi-picoult, chick-lit, contemporary, drama,...
8   0765317508  Aztec                   Gary Jennings   1980    [historical-fiction, historical, history, mexi...
9   0142501085  Marlfox                 Brian Jacques   1998    [fantasy, redwall, young-adult, childrens, chi...

但是,如果我知道如何计算分数,就不知道如何通过创建对等相似度数据帧来添加分数。我用Jaccard's tanimoto score / index尝试了以下方法:

def tanimoto_score(shelves_1, shelves_2):
    intersection_tanimoto = len([x for x in shelves_1 if x in shelves_2])
    union_tanimoto = len(shelves_1) + len(shelves_2)
    return(intersection_tanimoto/union_tanimoto)

similarity_df = pd.DataFrame() 
# for each line we compare the tags 
for index_1,row_1 in data.iterrows():
    for index_2, row_2 in data.iterrows():  
        similarity_score = tanimoto_score(data.at[index_1,'shelves'],data.at[index_2,'shelves'])
        similarity_df.loc[data.at[index_1,'title'],data.at[index_2,'title']] = similarity_score

它返回成对的similarity_df,但是结果似乎是错误的,因为一个与自身的相似度仅为0.5:

                    Krondor: The Betrayal   The Dark Is Rising  The Black Unicorn   I, Robot    Four Blondes    Love, Stargirl  The Tenth Circle    Vanishing Acts  Aztec   Marlfox
Krondor: The Betrayal            0.500000             0.119318           0.244318           0.164773              0.090909            0.068182            0.113636  0.096591    0.096591    0.136364
The Dark Is Rising               0.119318             0.500000           0.164773   0.159091    0.045455    0.232955    0.102273    0.085227    0.073864    0.238636
...

我想要类似的东西:

                      Krondor: The Betrayal   The Dark Is Rising ...
Krondor: The Betrayal                     1                 0.75
The Dark Is Rising                     0.75                    1
...

因此:如何在数据框行的文本标签上创建成对相似矩阵?

0 个答案:

没有答案