这是Scala中Kendall tau distance的正确实现
def distance[A : Ordering](s: Seq[A], t: Seq[A]): Int = {
assert(s.size == t.size, "Both sequences should be of the same length")
s.combinations(2).zip(t.combinations(2)).count {
case (Seq(s1, s2), Seq(t1, t2)) =>
(s1 > s2 && t1 < t2) || (s1 < s2 && t1 > t2)
}
}
问题是我没有足够的数据来测试算法,只有维基百科的几个例子。而且我不太了解算法以生成我自己的测试数据。大多数来源都是Kendall tau rank correlation coefficient,这是相关但不同的动物。也许我可以以某种方式从另一个中获得一个?
现在让我们说性能并不重要。
更新
所以,现在我有三个Kendall tau距离算法的实现。其中两个(distance1
和distance3
)给出相同的结果(见下文)。那么,哪一个是正确的?
import scala.math.Ordering.Implicits._
val permutations = Random.shuffle((0 until 5).permutations).take(100)
println("s\tt\tDist1\tDist2\tDist3")
permutations.sliding(2).foreach { case Seq(s, t) =>
println(s.mkString(",")+"\t"+t.mkString(",")+"\t"+distance1(s, t)+"\t"+distance2(s, t)+
"\t"+distance3(s, t))
}
def distance1[A : Ordering](s: Seq[A], t: Seq[A]): Int = {
assert(s.size == t.size, "Both sequences should be of the same length")
s.combinations(2).zip(t.combinations(2)).count { case (Seq(s1, s2), Seq(t1, t2)) =>
(s1 > s2 && t1 < t2) || (s1 < s2 && t1 > t2)
}
}
def distance2[A](a: Seq[A], b: Seq[A]): Int = {
val aMap = a.zipWithIndex.toMap // map of a items to their ranks
val bMap = b.zipWithIndex.toMap // map of b items to their ranks
a.combinations(2).count{case Seq(i, j) =>
val a1 = aMap.get(i).get // rank of i in A
val a2 = aMap.get(j).get // rank of j in A
val b1 = bMap.get(i).get // rank of i in B
val b2 = bMap.get(j).get // rank of j in B
a1.compare(a2) != b1.compare(b2)
}
}
def distance3(τ_1: Seq[Int], τ_2: Seq[Int]) =
(0 until τ_1.size).map { i =>
(i+1 until τ_2.size).count { j =>
(τ_1(i) < τ_1(j) && τ_2(i) > τ_2(j)) || (τ_1(i) > τ_1(j) && τ_2(i) < τ_2(j))
}
}.sum
以下是一些结果:
s t Dist1 Dist2 Dist3
3,0,4,2,1 1,4,3,0,2 6 6 6
1,4,3,0,2 0,4,1,2,3 3 5 3
0,4,1,2,3 4,0,1,3,2 8 2 8
4,0,1,3,2 1,2,0,4,3 4 6 4
1,2,0,4,3 2,3,1,4,0 3 5 3
2,3,1,4,0 1,0,3,2,4 8 6 8
1,0,3,2,4 1,3,2,4,0 7 3 7
1,3,2,4,0 4,3,0,1,2 6 6 6
4,3,0,1,2 1,0,2,4,3 7 7 7
1,0,2,4,3 3,4,1,2,0 8 8 8
3,4,1,2,0 1,4,2,0,3 5 5 5
1,4,2,0,3 1,0,3,4,2 8 4 8
答案 0 :(得分:1)
我不认为这是对的。这里有一些快速编写的代码,强调你所比较的是序列中项目的等级(你真的不想在你的代码中保留那些get(n).get
调用)。我也使用compare
,我认为这是有道理的:
def tauDistance[A](a: Seq[A], b: Seq[A]) = {
val aMap = a.zipWithIndex.toMap // map of a items to their ranks
val bMap = b.zipWithIndex.toMap // map of b items to their ranks
a.combinations(2).count{case Seq(i, j) =>
val a1 = aMap.get(i).get // rank of i in A
val a2 = aMap.get(j).get // rank of j in A
val b1 = bMap.get(i).get // rank of i in B
val b2 = bMap.get(j).get // rank of j in B
a1.compare(a2) != b1.compare(b2)
}
}
答案 1 :(得分:1)
所以,维基百科在这些元素的等级上定义了K:
K(τ_1,τ_2) = |{(i,j): i < j, (τ_1(i) < τ_1(j) && τ_2(i) > τ_2(j)) || (τ_1(i) > τ_1(j) && τ_2(i) < τ_2(j))}|
我们可以直接在Scala中实现这一点,记住输入是排名序列,而不是项目本身:
def K(τ_1: Seq[Int], τ_2: Seq[Int]) =
(0 until τ_1.size).map{i =>
(i+1 until τ_2.size).count{j =>
(τ_1(i) < τ_1(j) && τ_2(i) > τ_2(j)) || (τ_1(i) > τ_1(j) && τ_2(i) < τ_2(j))
}
}.sum
这实际上比上面的tauDistance
函数更有优势,因为该函数假设所有项都是唯一的(如果序列有重复项,则会失败),而这个函数直接在队列上工作。
有时使用组合函数是 hard ,而且通常只有单元测试才能通过。