Question

我正在使用Apache Spark上的数据库构建一个家族树，使用递归搜索来查找数据库中每个人的最终父级（即家族树顶部的人）。出于这个目的，假设在搜索他们的id时返回的第一个人是正确的父母

val peopleById = peopleRDD.keyBy(f => f.id)
def findUltimateParentId(personId: String) : String = {

    if((personId == null) || (personId.length() == 0))
        return "-1"

    val personSeq = peopleById.lookup(personId)
    val person = personSeq(0)
    if(person.personId == "0 "|| person.id == person.parentId) {

        return person.id

    }
    else {

        return findUltimateParentId(person.parentId)

    }

}

val ultimateParentIds = peopleRDD.foreach(f => f.findUltimateParentId(f.parentId))

这会产生以下错误“由以下引起：org.apache.spark.SparkException：RDD转换和操作只能由驱动程序调用，而不能在其他转换内部调用;例如，rdd1.map（x =＆gt; rdd2.values.count（）* x）无效，因为无法在rdd1.map转换中执行值转换和计数操作。有关更多信息，请参阅SPARK-5063。“

我从阅读其他类似的问题中了解到，问题是我在foreach循环中调用了findUltimateParentId，如果我从shell调用带有person id的方法，则返回正确的最终父id

但是，没有其他建议的解决方案对我有用，或者至少我看不到如何在我的程序中实现它们，任何人都可以帮忙吗？

Answer 1

如果我理解正确 - 这是一个适用于任何大小的输入的解决方案（虽然性能可能不是很好） - 它在RDD上执行N次迭代，其中N是“最深的家族”（从祖先到最大的距离）孩子）在输入中：

String

这是一个更好地理解其工作原理的测试：

// representation of input: each person has an ID and an optional parent ID
case class Person(id: Int, parentId: Option[Int])

// representation of result: each person is optionally attached its "ultimate" ancestor,
// or none if it had no parent id in the first place
case class WithAncestor(person: Person, ancestor: Option[Person]) {
  def hasGrandparent: Boolean = ancestor.exists(_.parentId.isDefined)
}

object RecursiveParentLookup {
  // requested method
  def findUltimateParent(rdd: RDD[Person]): RDD[WithAncestor] = {

    // all persons keyed by id
    def byId = rdd.keyBy(_.id).cache()

    // recursive function that "climbs" one generation at each iteration
    def climbOneGeneration(persons: RDD[WithAncestor]): RDD[WithAncestor] = {
      val cached = persons.cache()
      // find which persons can climb further up family tree
      val haveGrandparents = cached.filter(_.hasGrandparent)

      if (haveGrandparents.isEmpty()) {
        cached // we're done, return result
      } else {
        val done = cached.filter(!_.hasGrandparent) // these are done, we'll return them as-is
        // for those who can - join with persons to find the grandparent and attach it instead of parent
        val withGrandparents = haveGrandparents
          .keyBy(_.ancestor.get.parentId.get) // grandparent id
          .join(byId)
          .values
          .map({ case (withAncestor, grandparent) => WithAncestor(withAncestor.person, Some(grandparent)) })
        // call this method recursively on the result
        done ++ climbOneGeneration(withGrandparents)
      }
    }

    // call recursive method - start by assuming each person is its own parent, if it has one:
    climbOneGeneration(rdd.map(p => WithAncestor(p, p.parentId.map(i => p))))
  }

}

将输入映射到这些/** * Example input tree: * * 1 5 * | | * ----- 2 ----- 6 * | | * 3 4 * */ val person1 = Person(1, None) val person2 = Person(2, Some(1)) val person3 = Person(3, Some(2)) val person4 = Person(4, Some(2)) val person5 = Person(5, None) val person6 = Person(6, Some(5)) test("find ultimate parent") { val input = sc.parallelize(Seq(person1, person2, person3, person4, person5, person6)) val result = RecursiveParentLookup.findUltimateParent(input).collect() result should contain theSameElementsAs Seq( WithAncestor(person1, None), WithAncestor(person2, Some(person1)), WithAncestor(person3, Some(person1)), WithAncestor(person4, Some(person1)), WithAncestor(person5, None), WithAncestor(person6, Some(person5)) ) }对象并将输出Person对象映射到您需要的任何对象应该很容易。请注意，此代码假定如果任何人具有parentId X，则具有该id的另一个人实际存在于输入

中

Answer 2

使用SparkContext.broadcast修复此问题：

val peopleById = peopleRDD.keyBy(f => f.id)
val broadcastedPeople = sc.broadcast(peopleById.collectAsMap())

def findUltimateParentId(personId: String) : String = {

    if((personId == null) || (personId.length() == 0))
        return "-1"

    val personOption = broadcastedPeople.value.get(personId)
    if(personOption.isEmpty) {

        return "0";

    }
    val person = personOption.get
    if(person.personId == 0 || person.orgId == person.personId) {

        return person.id

    }
    else {

        return findUltimateParentId(person.parentId)

    }

}

val ultimateParentIds = peopleRDD.foreach(f => f.findUltimateParentId(f.parentId))

现在工作得很好！

Apache Spark中的递归方法调用

2 个答案: