下面的代码计算数据集中两个List之间的eucleudian距离:
val user1 = List("a", "1", "3", "2", "6", "9") //> user1 : List[String] = List(a, 1, 3, 2, 6, 9)
val user2 = List("b", "1", "2", "2", "5", "9") //> user2 : List[String] = List(b, 1, 2, 2, 5, 9)
val all = List(user1, user2) //> all : List[List[String]] = List(List(a, 1, 3, 2, 6, 9), List(b, 1, 2, 2, 5,
//| 9))
def euclDistance(userA: List[String], userB: List[String]) = {
println("comparing "+userA(0) +" and "+userB(0))
val zipped = userA.zip(userB)
val lastElements = zipped match {
case (h :: t) => t
}
val subElements = lastElements.map(m => ((m._1.toDouble - m._2.toDouble) * (m._1.toDouble - m._2.toDouble)))
val summed = subElements.sum
val sqRoot = Math.sqrt(summed)
sqRoot
} //> euclDistance: (userA: List[String], userB: List[String])Double
all.map(m => (all.map(m2 => euclDistance(m,m2))))
//> comparing a and a
//| comparing a and b
//| comparing b and a
//| comparing b and b
//| res0: List[List[Double]] = List(List(0.0, 1.4142135623730951), List(1.414213
//| 5623730951, 0.0))
但是如何将其转换为并行Spark Scala操作?
当我打印distAll的内容时:
scala> distAll.foreach(p => p.foreach(println))
14/10/24 23:09:42 INFO SparkContext: Starting job: foreach at <console>:21
14/10/24 23:09:42 INFO DAGScheduler: Got job 2 (foreach at <console>:21) with 4
output partitions (allowLocal=false)
14/10/24 23:09:42 INFO DAGScheduler: Final stage: Stage 2(foreach at <console>:2
1)
14/10/24 23:09:42 INFO DAGScheduler: Parents of final stage: List()
14/10/24 23:09:42 INFO DAGScheduler: Missing parents: List()
14/10/24 23:09:42 INFO DAGScheduler: Submitting Stage 2 (ParallelCollectionRDD[1
] at parallelize at <console>:18), which has no missing parents
14/10/24 23:09:42 INFO MemoryStore: ensureFreeSpace(1152) called with curMem=115
2, maxMem=278019440
14/10/24 23:09:42 INFO MemoryStore: Block broadcast_2 stored as values in memory
(estimated size 1152.0 B, free 265.1 MB)
14/10/24 23:09:42 INFO DAGScheduler: Submitting 4 missing tasks from Stage 2 (Pa
rallelCollectionRDD[1] at parallelize at <console>:18)
14/10/24 23:09:42 INFO TaskSchedulerImpl: Adding task set 2.0 with 4 tasks
14/10/24 23:09:42 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 8, lo
calhost, PROCESS_LOCAL, 1169 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 9, lo
calhost, PROCESS_LOCAL, 1419 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 10, l
ocalhost, PROCESS_LOCAL, 1169 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 11, l
ocalhost, PROCESS_LOCAL, 1420 bytes)
14/10/24 23:09:42 INFO Executor: Running task 0.0 in stage 2.0 (TID 8)
14/10/24 23:09:42 INFO Executor: Running task 1.0 in stage 2.0 (TID 9)
14/10/24 23:09:42 INFO Executor: Running task 3.0 in stage 2.0 (TID 11)
a14/10/24 23:09:42 INFO Executor: Running task 2.0 in stage 2.0 (TID 10)
14/10/24 23:09:42 INFO Executor: Finished task 2.0 in stage 2.0 (TID 10). 585 by
tes result sent to driver
114/10/24 23:09:42 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 10)
in 16 ms on localhost (1/4)
314/10/24 23:09:42 INFO Executor: Finished task 0.0 in stage 2.0 (TID 8). 585 by
tes result sent to driver
214/10/24 23:09:42 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 8) i
n 16 ms on localhost (2/4)
6
9
14/10/24 23:09:42 INFO Executor: Finished task 1.0 in stage 2.0 (TID 9). 585 byt
es result sent to driver
b14/10/24 23:09:42 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 9) i
n 16 ms on localhost (3/4)
1
2
2
5
9
14/10/24 23:09:42 INFO Executor: Finished task 3.0 in stage 2.0 (TID 11). 585 by
tes result sent to driver
14/10/24 23:09:42 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 11) i
n 31 ms on localhost (4/4)
14/10/24 23:09:42 INFO DAGScheduler: Stage 2 (foreach at <console>:21) finished
in 0.031 s
14/10/24 23:09:42 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have
all completed, from pool
14/10/24 23:09:42 INFO SparkContext: Job finished: foreach at <console>:21, took
0.037641021 s
距离未填充?
更新:
要让Eugene Zhulenev在下面回答我的工作,我需要做出以下更改:
使用java.io.Serializable
扩展UserObject还将User重命名为UserObject。
这是更新的代码:
val user1 = List("a", "1", "3", "2", "6", "9") //> user1 : List[String] = List(a, 1, 3, 2, 6, 9)
val user2 = List("b", "1", "2", "2", "5", "9") //> user2 : List[String] = List(b, 1, 2, 2, 5, 9)
case class User(name: String, features: Vector[Double])
object UserObject extends java.io.Serializable {
def fromList(list: List[String]): User = list match {
case h :: tail => User(h, tail.map(_.toDouble).toVector)
}
}
val all = List(UserObject.fromList(user1), UserObject.fromList(user2))
val users= sc.parallelize(all.combinations(2).toSeq.map {
case l :: r :: Nil => (l, r)
})
def euclDistance(userA: User, userB: User) = {
println(s"comparing ${userA.name} and ${userB.name}")
val subElements = (userA.features zip userB.features) map {
m => (m._1 - m._2) * (m._1 - m._2)
}
val summed = subElements.sum
val sqRoot = Math.sqrt(summed)
println("value is"+sqRoot)
sqRoot
}
users.foreach(t => euclDistance(t._1, t._2))
更新2:
我已在maasg回答中尝试过代码,但收到错误:
scala> val userDistanceRdd = usersRdd.map { case (user1, user2) => {
| val data = sc.broadcast.value
| val distance = euclidDistance(data(user1), data(user2))
| ((user1, user2),distance)
| }
| }
<console>:27: error: missing arguments for method broadcast in class SparkContex
t;
follow this method with `_' if you want to treat it as a partially applied funct
ion
val data = sc.broadcast.value
以下是修改后的整个代码:
type UserId = String
type UserData = Array[Double]
val users: List[UserId]= List("a" , "b")
val data: Map[UserId,UserData] = Map( ("a" , Array(3.0,4.0)),
("b" , Array(3.0,4.0)) )
def combinations[T](l: List[T]): List[(T,T)] = l match {
case Nil => Nil
case h::Nil => Nil
case h::t => t.map(x=>(h,x)) ++ combinations(t)
}
val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) =>
math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map { case (user1, user2) => {
val data = sc.broadcast.value
val distance = euclidDistance(data(user1), data(user2))
((user1, user2),distance)
}
}
要使maasg代码生效,我需要将}
添加到userDistanceRdd
函数。
代码:
type UserId = String
type UserData = Array[Double]
val users: List[UserId] = List("a" , "b")
val data: Map[UserId,UserData] = Map( ("a" , Array(3.0,4.0)),
("b" , Array(3.0,3.0)) )
def combinations[T](l: List[T]): List[(T,T)] = l match {
case Nil => Nil
case h::Nil => Nil
case h::t => t.map(x=>(h,x)) ++ combinations(t)
}
val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) =>
math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
val data = broadcastData.value
val distance = euclidDistance(data(user1), data(user2))
((user1, user2),distance)
}
}
userDistanceRdd.foreach(println)
答案 0 :(得分:2)
首先,我建议您将列表中的用户模型存储到类型良好的类中。然后我认为您不需要计算相同用户之间的距离,如(a-a)和(b-b),并且没有理由计算距离两次(a-b)(b-a)。
val user1 = List("a", "1", "3", "2", "6", "9")
val user2 = List("b", "1", "2", "2", "5", "9")
case class User(name: String, features: Vector[Double])
object User {
def fromList(list: List[String]): User = list match {
case h :: tail => User(h, tail.map(_.toDouble).toVector)
}
}
def euclDistance(userA: User, userB: User) = {
println(s"comparing ${userA.name} and ${userB.name}")
val subElements = (userA.features zip userB.features) map {
m => (m._1 - m._2) * (m._1 - m._2)
}
val summed = subElements.sum
val sqRoot = Math.sqrt(summed)
sqRoot
}
val all = List(User.fromList(user1), User.fromList(user2))
val users: RDD[(User, User)] = sc.parallelize(all.combinations(2).toSeq.map {
case l :: r :: Nil => (l, r)
})
users.foreach(t => euclDistance(t._1, t._2))
答案 1 :(得分:1)
实际解决方案将取决于数据集的维度。假设原始数据集适合内存并且您希望并行计算欧氏距离,我会这样做:
假设users
是某个ID的用户列表,而userData
是每个用户ID索引的数据。
// sc is the Spark Context
type UserId = String
type UserData = Array[Double]
val users: List[UserId]= ???
val data: Map[UserId,UserData] = ???
// combination generates the unique pairs of users for which distance makes sense
// given that euclidDistance (a,b) = eclidDistance(b,a) only (a,b) is in this set
def combinations[T](l: List[T]): List[(T,T)] = l match {
case Nil => Nil
case h::Nil => Nil
case h::t => t.map(x=>(h,x)) ++ comb(t)
}
// broadcasts the data to all workers
val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) =>
math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
val data = broadcastData.value
val distance = euclidDistance(data(user1), data(user2))
((user1, user2),distance)
}
如果用户数据太大,您可以从外部存储加载广播变量,而不是使用广播变量。