Question

我想查看一个新图形（称为A）是否是其他图形的子图形（称为B）。我写了一个小测试，但失败了！我只在spark-shell，spark版本1.6.1上运行演示：

// Build the GraphB
val usersB = sc.parallelize(Array(
  (3L, ("rxin", "student")),
  (7L, ("jgonzal","postdoc")),
  (5L, ("franklin", "prof")),
  (2L, ("istoica", "prof"))
))

val relationshipsB = sc.parallelize(Array(
  Edge(3L, 7L, "collab"),
  Edge(5L, 3L, "advisor"),
  Edge(2L, 5L, "colleague"),
  Edge(5L, 7L, "pi")
))

val defaultUser = ("John Doe", "Missing")

val graphB = Graph(usersB, relationshipsB, defaultUser)

// Build the initial Graph A
val usersA = sc.parallelize(Array(
  (3L, ("rxin", "student")),
  (7L, ("jgonzal", "postdoc")),
  (5L, ("franklin", "prof"))
))

val relationshipsA = sc.parallelize(Array(
  Edge(3L, 7L, "collab"),
  Edge(5L, 3L, "advisor")
))

val testGraphA = Graph(usersA, relationshipsA, defaultUser)

//do the mask
val maskResult = testGraphA.mask(graphB)
maskResult.edges.count
maskResult.vertices.count

在我对API on spark website的理解中，掩码函数可以得到所有相同的边和顶点。但是，结果是顶点只是正确的（maskResult.vertices.count = 3），边数应该是2而不是（maskResult.edges.count = 0）。

Answer 1

如果您查看 the source ，您会看到mask使用EdgeRDD.innerJoin。如果您查看innerJoin的 the documentation ，您会看到警告：

Inner将此EdgeRDD与另一个EdgeRDD连接，假设两个分区使用相同的PartitionStrategy。

您需要创建并使用PartitionStrategy。如果您执行以下操作，它将获得您想要的结果（但可能无法很好地扩展）：

object MyPartStrat extends PartitionStrategy {
  override def getPartition(s: VertexId, d: VertexId, n: PartitionID) : PartitionID = {
    1     // this is just to prove the point, you'll need a real partition strategy
  }
}

然后，如果你这样做：

val maskResult = testGraphA.partitionBy(MyPartStrat).mask(graphB.partitionBy(MyPartStrat))

您将获得所需的结果。但就像我说的那样，你可能需要找出一个更好的分区策略，而不仅仅是将所有东西都塞进一个分区。

如何使用Spark图形的功能掩码？

1 个答案: