我听说K最近邻居找到一个项目所属的类别,但我想知道是否有一个算法会根据属性返回一个项目列表。
例如给定一部电影
[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"]
结果将返回
[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"]
而不是
[director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"]
因为前者的结果与更多属性“希尔汤普森”和“威尔史密斯”匹配,而前者只有一场比赛 - 希尔汤普森。
余弦相似度是解决这个问题的好方法吗?
答案 0 :(得分:2)
余弦相似性是解决这个问题的好方法吗?
是。这将是好的,但使用TF-IDF
最常用的相似性度量是Jaccard Similarity
和Cosine similarity
。
在给出的场景中,您可以直接使用Jaccard Similarity
并获得所需的结果。
说,
A = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}
Jaccard Similarity
将是:
J(A,B) = 2 / 4 = 0.5
J(A,C) = 1 / 5 = 0.2
J(C,D) = 1 / 5 = 0.2
正如J(A,B) > J(A,C)
K nearest neighbour
方法首先选择B
然后C
。
在这种情况下,Jaccard similarity
很好地捕捉了直觉。
要演示Cosine Similarity
更好的方式,请再添加一个属性:
A = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {place filmed : "A", director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {place filmed : "A", director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}
J(A,B) = 3 / 5 = 0.6
J(A,C) = 2 / 6 = 0.33
J(C,D) = 2 / 6 = 0.33
请注意J(C,A) = J(C,D)
错误直觉。
为什么呢?
因为地方A似乎是录制电影的常用地方。仅仅因为在同一个地方录制了两部电影,我们无法断定它们是相似的。理想情况下应该是Sim(C,D) > Sim(C,A)
。如果Jaccard Similarity
无法捕捉直觉,Cosine similarity
TF-IDF
优于Cosine Similarity
,则会出现这种情况。
在这种情况下Cosine similarity
的问题是实施。 boolean
是在向量上定义的。当数据不是数字时,很难创建矢量。
创建向量的一种方法是vector = [A,HillThompson,FooBar,WillSmith,Poop Jenkins,Dec 1776,Jan 1996, Sept 1822, Some date]
的向量。
例如, 矢量将形成为:
A = {1,1,0,1,0,1,0,0,0}
C = {1,1,0,0,1,0,0,1,0}
D = {1,0,1,0,1,0,0,0,1}
J(C,A) = 5 / 12
J(C,D) = 5 / 12
载体将是:
Jaccard Similarity
请注意,Cosine Similarity
仍然会捕获错误的直觉。如果没有完成TF-IDF,IDF(A) = log( 1 + 4 / 4) = 0.30
IDF(HillThompson) = log( 1 + 4 / 3) = 0.37
IDF(FooBar) = log( 1 + 4 / 1) = 0.70
IDF(WillSmith) = log( 1 + 4 / 2) = 0.48
IDF(Poop Jenkins) = log( 1 + 4 / 2) = 0.48
IDF(Dec 1776) = log( 1 + 4 / 1) = 0.70
IDF(Jan 1996) = log( 1 + 4 / 1) = 0.70
IDF(Sept 1822) = log( 1 + 4 / 1) = 0.70
IDF(Some date) = log( 1 + 4 / 1) = 0.70
也是如此。
现在计算TF-IDF:
A = {0.30/4, 0.37/4, 0, 0.48/4, 0, 0.70/4, 0, 0, 0}
C = {0.30/4, 0.37/4, 0, 0, 0.48/4, 0, 0, 0.70/4, 0}
D = {0.30/4, 0, 0.70/4, 0, 0.48/4, 0, 0, 0, 0.70/4}
A = {0.075, 0.0925, 0, 0.12, 0, 0.175, 0, 0, 0 }
C = {0.075, 0.0925, 0, 0, 0.12, 0, 0, 0.175, 0 }
D = {0.075, 0, 0.175, 0, 0.12, 0, 0, 0, 0.175 }
|A| = 0.2433
|C| = 0.2433
|D| = 0.2850
IF-IDF向量现在是:
Cosine(A,C) = 0.01418 / ( 0.2433 * 0.2433 ) = 0.2395
Cosine(C,D) = 0.0200 / ( 0.2492 * 0.2850 ) = 0.2816
计算余弦相似度:
Cosine similarity
因此,TF-IDF
D
抓住了C
与A
更相似的直觉,C
与Jaccard similarity
更相似。因此它优于{{1}}
请注意我已经展示了计算,因为我已经在PC上完成了它们而不是科学计算器。可能存在错误的可能性。如果你找到一个,请纠正它。