Scala-Spark:将数据帧转换为RDD [Edge]

时间:2017-09-24 10:15:38

标签: scala spark-dataframe rdd spark-graphx

我有一个表示图形边缘的数据框;这是架构:

root |-- src: string (nullable = true) 
     |-- dst: string (nullable = true) 
     |-- relationship: struct (nullable = false) 
     | |-- business_id: string (nullable = true) 
     | |-- normalized_influence: double (nullable = true)

我想将它转换为RDD [Edge]以使用Pregel API,我的困难在于属性“关系”。如何转换呢?

1 个答案:

答案 0 :(得分:1)

Edge是参数化类。这意味着除了源和目标ID之外,您还可以在每个边缘存储您喜欢的任何内容。在您的情况下,它可能是Edge[Relationship]。您可以使用案例类来映射数据框和RDD[Edge[Relationship]]

import scala.util.hashing.MurmurHash3
case class Relationship(business_id: String, normalized_influence: Double)
case class MyEdge(src: String, dst: String, relationship: Relationship)

val edges: RDD[Edge[Relationship]] = df.as[MyEdge].rdd.map { edge =>
    Edge(
        MurmurHash3.stringHash(edge.src).toLong, // VertexId type is a Long, so we need to hash your string
        MurmurHash3.stringHash(edge.dst).toLong,
        edge.relationship
    )
}