GraphFrames api是否支持创建二分图?

时间:2016-04-13 14:37:26

标签: apache-spark graphframes

GraphFrames api是否支持在当前版本中创建Bipartite图?

当前版本:0.1.0

Spark版本:1.6.1

1 个答案:

答案 0 :(得分:4)

正如对此问题的评论所指出的,GraphFrames和GraphX都没有内置的二分图支持。但是,它们都具有足够的灵活性,可以让您创建二分图。对于GraphX解决方案,请参阅 this previous answer 。该解决方案使用不同顶点/对象类型之间的共享特征。虽然这适用于RDDs,但这对DataFrames无效。 DataFrame中的一行有一个固定的架构 - 它有时不能包含price列,有时也不能。它可以有price列,有时为null,但该列必须存在于每一行中。

相反,GraphFrames的解决方案似乎是你需要定义一个DataFrame,它基本上是二分图中两种类型对象的线性子类型 - 它必须包含所有对象两种类型的对象的字段。这实际上非常简单 - join full_outer会给你这个。像这样:

val players = Seq(
  (1,"dave", 34),
  (2,"griffin", 44)
).toDF("id", "name", "age")

val teams = Seq(
  (101,"lions","7-1"),
  (102,"tigers","5-3"),
  (103,"bears","0-9")
).toDF("id","team","record")

然后你可以像这样创建一个超集DataFrame

val teamPlayer = players.withColumnRenamed("id", "l_id").join(
  teams.withColumnRenamed("id", "r_id"),
  $"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
 .drop($"r_id")
 .withColumnRenamed("l_id", "id")

teamPlayer.show

+---+-------+----+------+------+
| id|   name| age|  team|record|
+---+-------+----+------+------+
|101|   null|null| lions|   7-1|
|102|   null|null|tigers|   5-3|
|103|   null|null| bears|   0-9|
|  1|   dave|  34|  null|  null|
|  2|griffin|  44|  null|  null|
+---+-------+----+------+------+

你可以用structs

做一点清洁工作
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
  teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
  $"l_id" === $"r_id",
  "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
 .drop($"r_id")
 .withColumnRenamed("l_id", "id")

tpStructs.show

+---+------------+------------+
| id|      player|        team|
+---+------------+------------+
|101|        null| [lions,7-1]|
|102|        null|[tigers,5-3]|
|103|        null| [bears,0-9]|
|  1|   [dave,34]|        null|
|  2|[griffin,44]|        null|
+---+------------+------------+

我还要指出,GraphXRDDs或多或少相同的解决方案。您可以通过加入两个不共享case classes的{​​{1}}来创建顶点:

traits

完全遵循上一个答案,这似乎是一种更灵活的方式来处理它 - 而不必在组合对象之间共享case class Player(name: String, age: Int) val playerRdd = sc.parallelize(Seq( (1L, Player("date", 34)), (2L, Player("griffin", 44)) )) case class Team(team: String, record: String) val teamRdd = sc.parallelize(Seq( (101L, Team("lions", "7-1")), (102L, Team("tigers", "5-3")), (103L, Team("bears", "0-9")) )) playerRdd.fullOuterJoin(teamRdd).collect foreach println (101,(None,Some(Team(lions,7-1)))) (1,(Some(Player(date,34)),None)) (102,(None,Some(Team(tigers,5-3)))) (2,(Some(Player(griffin,44)),None)) (103,(None,Some(Team(bears,0-9))))