我想查看一个新图形(称为A)是否是其他图形的子图形(称为B)。我写了一个小测试,但失败了!我只在spark-shell,spark版本1.6.1上运行演示:
// Build the GraphB
val usersB = sc.parallelize(Array(
(3L, ("rxin", "student")),
(7L, ("jgonzal","postdoc")),
(5L, ("franklin", "prof")),
(2L, ("istoica", "prof"))
))
val relationshipsB = sc.parallelize(Array(
Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"),
Edge(2L, 5L, "colleague"),
Edge(5L, 7L, "pi")
))
val defaultUser = ("John Doe", "Missing")
val graphB = Graph(usersB, relationshipsB, defaultUser)
// Build the initial Graph A
val usersA = sc.parallelize(Array(
(3L, ("rxin", "student")),
(7L, ("jgonzal", "postdoc")),
(5L, ("franklin", "prof"))
))
val relationshipsA = sc.parallelize(Array(
Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor")
))
val testGraphA = Graph(usersA, relationshipsA, defaultUser)
//do the mask
val maskResult = testGraphA.mask(graphB)
maskResult.edges.count
maskResult.vertices.count
在我对API on spark website的理解中,掩码函数可以得到所有相同的边和顶点。但是,结果是顶点只是正确的(maskResult.vertices.count = 3),边数应该是2而不是(maskResult.edges.count = 0)。
答案 0 :(得分:2)
如果您查看 the source ,您会看到mask
使用EdgeRDD.innerJoin
。如果您查看innerJoin
的 the documentation ,您会看到警告:
Inner将此EdgeRDD与另一个EdgeRDD连接,假设两个分区使用相同的PartitionStrategy。
您需要创建并使用PartitionStrategy
。如果您执行以下操作,它将获得您想要的结果(但可能无法很好地扩展):
object MyPartStrat extends PartitionStrategy {
override def getPartition(s: VertexId, d: VertexId, n: PartitionID) : PartitionID = {
1 // this is just to prove the point, you'll need a real partition strategy
}
}
然后,如果你这样做:
val maskResult = testGraphA.partitionBy(MyPartStrat).mask(graphB.partitionBy(MyPartStrat))
您将获得所需的结果。但就像我说的那样,你可能需要找出一个更好的分区策略,而不仅仅是将所有东西都塞进一个分区。