按属性查找类似产品

时间:2017-03-09 03:01:38

标签: algorithm

我听说K最近邻居找到一个项目所属的类别,但我想知道是否有一个算法会根据属性返回一个项目列表。

例如给定一部电影

[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"]

结果将返回

[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"]

而不是

[director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"]

因为前者的结果与更多属性“希尔汤普森”和“威尔史密斯”匹配,而前者只有一场比赛 - 希尔汤普森。

余弦相似度是解决这个问题的好方法吗?

1 个答案:

答案 0 :(得分:2)

余弦相似性是解决这个问题的好方法吗?

是。这将是好的,但使用TF-IDF

最常用的相似性度量是Jaccard SimilarityCosine similarity。 在给出的场景中,您可以直接使用Jaccard Similarity并获得所需的结果。

说,

A = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}

Jaccard Similarity将是:

J(A,B) = 2 / 4 = 0.5
J(A,C) = 1 / 5 = 0.2
J(C,D) = 1 / 5 = 0.2

正如J(A,B) > J(A,C) K nearest neighbour方法首先选择B然后C。 在这种情况下,Jaccard similarity很好地捕捉了直觉。

要演示Cosine Similarity更好的方式,请再添加一个属性:

A = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {place filmed : "A", director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {place filmed : "A", director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}


J(A,B) = 3 / 5 = 0.6
J(A,C) = 2 / 6 = 0.33
J(C,D) = 2 / 6 = 0.33

请注意J(C,A) = J(C,D)

错误直觉。

为什么呢? 因为地方A似乎是录制电影的常用地方。仅仅因为在同一个地方录制了两部电影,我们无法断定它们是相似的。理想情况下应该是Sim(C,D) > Sim(C,A)。如果Jaccard Similarity无法捕捉直觉,Cosine similarity TF-IDF优于Cosine Similarity,则会出现这种情况。

在这种情况下Cosine similarity的问题是实施。 boolean是在向量上定义的。当数据不是数字时,很难创建矢量。

创建向量的一种方法是vector = [A,HillThompson,FooBar,WillSmith,Poop Jenkins,Dec 1776,Jan 1996, Sept 1822, Some date] 的向量。

例如, 矢量将形成为:

A = {1,1,0,1,0,1,0,0,0}
C = {1,1,0,0,1,0,0,1,0}
D = {1,0,1,0,1,0,0,0,1}

J(C,A) = 5 / 12
J(C,D) = 5 / 12

载体将是:

Jaccard Similarity

请注意,Cosine Similarity仍然会捕获错误的直觉。如果没有完成TF-IDF,IDF(A) = log( 1 + 4 / 4) = 0.30 IDF(HillThompson) = log( 1 + 4 / 3) = 0.37 IDF(FooBar) = log( 1 + 4 / 1) = 0.70 IDF(WillSmith) = log( 1 + 4 / 2) = 0.48 IDF(Poop Jenkins) = log( 1 + 4 / 2) = 0.48 IDF(Dec 1776) = log( 1 + 4 / 1) = 0.70 IDF(Jan 1996) = log( 1 + 4 / 1) = 0.70 IDF(Sept 1822) = log( 1 + 4 / 1) = 0.70 IDF(Some date) = log( 1 + 4 / 1) = 0.70 也是如此。

现在计算TF-IDF:

A = {0.30/4, 0.37/4, 0,      0.48/4, 0,       0.70/4, 0, 0,      0}
C = {0.30/4, 0.37/4, 0,      0,      0.48/4,  0,      0, 0.70/4, 0}
D = {0.30/4,      0, 0.70/4, 0,      0.48/4,  0,      0, 0,      0.70/4}

A = {0.075,  0.0925, 0,      0.12,   0,       0.175,  0, 0,     0 } 
C = {0.075,  0.0925, 0,      0,      0.12,    0,      0, 0.175, 0 }
D = {0.075,  0,      0.175,  0,      0.12,    0,      0, 0,     0.175 }

|A| = 0.2433
|C| = 0.2433
|D| = 0.2850

IF-IDF向量现在是:

Cosine(A,C) = 0.01418 / ( 0.2433 * 0.2433 ) = 0.2395
Cosine(C,D) = 0.0200  / ( 0.2492 * 0.2850 ) = 0.2816

计算余弦相似度:

Cosine similarity

因此,TF-IDF D抓住了CA更相似的直觉,CJaccard similarity更相似。因此它优于{{1}}

请注意我已经展示了计算,因为我已经在PC上完成了它们而不是科学计算器。可能存在错误的可能性。如果你找到一个,请纠正它。