如何在我的示例中加入数据框?

时间:2018-05-07 21:04:58

标签: scala apache-spark spark-dataframe

我有两个数据帧:

edges =
   srcId    dstId    timestamp
   1        3        1345534569
   1        4        1346564657
   1        2        1345769687
   2        3        1345769687
   4        3        1345769687


vertices =
   id   name   s_type
   1    abc    A
   2    def    B
   3    rtf    C
   4    wrr    D

我想获得以下结构的数据框(第一行的示例):

result = 

       srcId    name_src   s_type_src   dstId   name_dst   s_type_dst    timestamp
       1        abc        A            3       rtf        C             1345534569

换句话说,我想将前缀_src添加到由srcId加入的列中。我想将前缀_dst添加到由dstId加入的列。

这是我解决任务的方法,但我不知道如何为列名分配_src_dst前缀:

val result = edges
                .join(vertices, col("srcId")===col("id"),"inner")
                .join(vertices, col("dstId")===col("id"),"inner")

1 个答案:

答案 0 :(得分:1)

您可以通过select

简单as()列别名
val edges = Seq(
  (1, 3, 1345534569),
  (1, 4, 1346564657),
  (1, 2, 1345769687),
  (2, 3, 1345769687),
  (4, 3, 1345769687)
).toDF("srcId", "dstId", "timestamp")

val vertices = Seq(
  (1, "abc", "A"),
  (2, "def", "B"),
  (3, "rtf", "C"),
  (4, "wrr", "D")
).toDF("id", "name", "s_type")

import org.apache.spark.sql.functions._

val result = edges.
  join(vertices.as("s"), $"srcId" === $"s.id", "inner").
  join(vertices.as("d"), $"dstId" === $"d.id", "inner").
  select(
    $"srcId", $"s.name".as("name_src"), $"s.s_type".as("s_type_src"),
    $"dstId", $"d.name".as("name_dst"), $"d.s_type".as("s_type_dst"),
    $"timestamp"
  )

result.show
// +-----+--------+----------+-----+--------+----------+----------+
// |srcId|name_src|s_type_src|dstId|name_dst|s_type_dst| timestamp|
// +-----+--------+----------+-----+--------+----------+----------+
// |    1|     abc|         A|    3|     rtf|         C|1345534569|
// |    1|     abc|         A|    4|     wrr|         D|1346564657|
// |    1|     abc|         A|    2|     def|         B|1345769687|
// |    2|     def|         B|    3|     rtf|         C|1345769687|
// |    4|     wrr|         D|    3|     rtf|         C|1345769687|
// +-----+--------+----------+-----+--------+----------+----------+

或者,您可以在加入之前相应地重命名vertices列,如下所示:

val cols = vertices.columns
val v_src = vertices.toDF(cols.map(_ + "_src"): _*)
val v_dst = vertices.toDF(cols.map(_ + "_dst"): _*)

val result = edges.
  join(v_src, $"srcId" === $"id_src", "inner").
  join(v_dst, $"dstId" === $"id_dst", "inner")