数据的相似性度量/矩阵(推荐系统) - Python

时间:2016-11-13 04:35:19

标签: python numpy matrix machine-learning similarity

我是机器学习的新手,我正试图尝试以下问题。 输入是2个具有相同长度的描述数组,输出是第一个数组的第一个字符串与第二个数组中的第一个字符串的相似性得分数组等。

数组中的每个项目(numpy数组)都是一个描述字符串。你能编写一个函数,通过计算有多少相同和共同出现的单词ID来找出两个字符串之间的相似程度,并为其分配一个分数(一个可能的权重可以基于共现频率与频率之和个别字ID)。然后将该函数应用于两个数组以获得一组分数。 如果您还有其他方法需要考虑,也请告诉我。 谢谢!

数据:

array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',
       '18/19/20/21/22/23/24/25',
       '26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',
       '5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',
       '57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',
       '70/71/72/73/74/75/76/77',
       '78/79/80/81/82/83/84/85/86/87/88/89/90/91',
       '33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',
       '104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',
       '117/118/119/120/121/12/122/123/124/125',
       '14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',
       '137/138/139/140/141/142',
       '143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',
       '160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',
       '172/173/174/175/176/177/73/178/104/179/180/179/181/173',
       '182/144/183/179/73',
       '184/163/68/185/163/8/186/187/188/54/189/190/191',
       '181/192/0/1/193/194/22/195',
       '113/196/197/198/68/199/68/200/201/202/203/201',
       '204/205/206/207/208/209/68/200',
       '163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',
       '220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',
       '214/228/5/6/5/215/228/228/229',
       '230/231/232/233/122/215/128/214/128/234/234',
       '235/236/191/237/92/93/238/239',
       '13/14/44/44/240/241/242/49/54/243/244/245/55/56',
       '220/21/246/38/247/201/248/73/160/249/250/203/201',
       '214/49/251/252/253/254/255/256/257/258'], 
      dtype='|S127')

array(['151/308/309/310/311/215/312/160/313/214/49/12',
       '314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',
       '324/325/62/220/326/194/327/328/218/76/241/329',
       '330/29/22/103/331/314/68/80/49',
       '78/332/85/96/97/227/333/4/334/188',
       '57/335/336/34/187/337/21/338/212/213/339/340',
       '341/342/167/343/8/254/154/61/344',
       '2/292/345/346/42/347/348/348/100/349/202/161/263',
       '283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',
       '137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',
       '23/363/10/364/289/68/123/354/355',
       '188/28/365/149/366/98/367/368/369/370/371/372/368',
       '373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',
       '179/376/377/378/179/87/88/379/20',
       '380/85/381/333/382/215/128/383/384', '385/129/386/387/388',
       '389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',
       '397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',
       '77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',
       '129/295/90/259/38/39/119/414/415/416/14/318/417/418',
       '419/420/421/422/423/23/424/241/421/425/58',
       '426/244/427/5/428/49/76/429/430/431',
       '257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',
       '439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',
       '385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'], 
      dtype='|S127')

1 个答案:

答案 0 :(得分:1)

以下代码可以为您提供Python 3.x

所需的功能
import numpy as np
from collections import Counter

def jaccardSim(c1, c2):
    cU = c1 | c2
    cI = c1 & c2
    sim = sum(cI.values()) / sum(cU.values())
    return sim

def byteArraySim(b1, b2):
    cA = [Counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))
          for i in range(len(b1))]
    cB = [Counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))
          for i in range(len(b2))]

    # Assuming both 'a' and 'b' are in the same length
    cSim = [jaccardSim(cA[i], cB[i]) for i in range(len(a))]

    return cSim # Array of similarities

在此实施中使用Jaccard相似度得分。您可以根据自己的喜好选择其他分数,例如余弦或汉明。

假设数组存储在变量ab中,结果函数byteArraySim(a,b)输出以下相似度分数:

[0.0,
 0.0,
 0.0,
 0.038461538461538464,
 0.0,
 0.041666666666666664,
 0.0,
 0.0,
 0.0,
 0.08,
 0.0,
 0.05555555555555555,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.058823529411764705,
 0.0,
 0.0,
 0.0,
 0.05555555555555555,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]