Spark:使用来自rdd的值从rdd创建带有模式的数据帧

时间:2016-10-05 14:15:37

标签: apache-spark

假设我有以下rdd:

val aSeq = Seq(("a",Seq(("aa",1.0),("bb",2.0),("cc",3.0))),
               ("b",Seq(("aa",3.0),("bb",4.0),("cc",5.0))),
               ("c",Seq(("aa",6.0),("bb",7.0),("cc",8.0))),
               ("d",Seq(("aa",9.0),("bb",10.0),("cc",11.0))))

val anRdd = sc.parallelize(aSeq)

如何创建一个数据框,该数据框使用Sequence中的第一个值来命名和构建模式?如果我将其转换为df,我得到以下内容:

val aDF = anRDD.toDF("id","column2")
aDF.printSchema

root
  |---id: string
  |---column2: array
         |---- element: struct
                 |-----_1: string
                 |-----_2: double

更清楚我想要的是以下内容:

root
 |--id: String(nullable = true)
 |--column2:struct (nullable = true)
        |----aa: Double
        |----bb: Double
        |----cc: Double

修改

@eliasah给出了一个非常容易理解的答案,它给出了所需的输出。我试图在我的一个真实的例子中表现出来,这个例子更深入' /嵌套。为了说明,我给出了第一个例子的一个级别的以下示例:

val aSeq = Seq(("a",Seq(("aa",(("aaa",1.0),("bbb",Array(2.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,4.0)))),("cc",(("aaa",4.0),("bbb",Array(9.0,3.0)))))),
               ("b",Seq(("aa",(("aaa",1.0),("bbb",Array(3.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,3.0)))),("cc",(("aaa",4.0),("bbb",Array(3.0,9.0)))))),
               ("c",Seq(("aa",(("aaa",1.0),("bbb",Array(3.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,3.0)))),("cc",(("aaa",4.0),("bbb",Array(3.0,9.0)))))),
               ("d",Seq(("aa",(("aaa",1.0),("bbb",Array(3.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,3.0)))),("cc",(("aaa",4.0),("bbb",Array(3.0,9.0)))))))

val anRddB = sc.parallelize(aSeqB)

如何使用以下架构的DF:

root
 |--id: String
 |--column2:struct
       |----aa:struct
             |--aaa:Double
             |--bbb:array
                 |--element: double
       |----bb:struct
             |--aaa:Double
             |--bbb:array
                 |--element: double
       |----cc:struct
             |--aaa:Double
             |--bbb:array
                 |--element: double

如何做到这一点?

1 个答案:

答案 0 :(得分:2)

如果我理解你的问题,解决方案并不漂亮,但现在就是这样。您需要导入struct功能:

scala> import org.apache.spark.sql.functions.struct
// import org.apache.spark.sql.functions.struct

scala> val seq = Seq(("a",Seq(("aa",(("aaa",1.0),("bbb",Array(2.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,4.0)))),("cc",(("aaa",4.0),("bbb",Array(9.0,3.0)))))),
           ("b",Seq(("aa",(("aaa",1.0),("bbb",Array(3.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,3.0)))),("cc",(("aaa",4.0),("bbb",Array(3.0,9.0)))))),
           ("c",Seq(("aa",(("aaa",1.0),("bbb",Array(3.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,3.0)))),("cc",(("aaa",4.0),("bbb",Array(3.0,9.0)))))),
           ("d",Seq(("aa",(("aaa",1.0),("bbb",Array(3.0,2.0)))),("bb",(("aaa",8.0),("bbb",Array(3.0,3.0)))),("cc",(("aaa",4.0),("bbb",Array(3.0,9.0)))))))

scala> val anRdd = sc.parallelize(seq)

column2转换为地图:

scala> val df = anRDD.map(x => (x._1, x._2.toMap)).toDF("x", "y")
// df: org.apache.spark.sql.DataFrame = [x: string, y: map<string,double>]

拉起第一组字段:

scala> val df2 = df.select($"x".as("id"), struct($"y".getItem("aa").as("aa"),$"y".getItem("bb").as("bb"),$"y".getItem("cc").as("cc")).as("column2"))
// df2: org.apache.spark.sql.DataFrame = [id: string, column2: struct<aa:struct<_1:struct<_1:string,_2:double>,_2:struct<_1:string,_2:array<double>>>,bb:struct<_1:struct<_1:string,_2:double>,_2:struct<_1:string,_2:array<double>>>,cc:struct<_1:struct<_1:string,_2:double>,_2:struct<_1:string,_2:array<double>>>>]

scala> df2.printSchema
// root
//  |-- id: string (nullable = true)
//  |-- column2: struct (nullable = false)
//  |    |-- aa: struct (nullable = true)
//  |    |    |-- _1: struct (nullable = true)
//  |    |    |    |-- _1: string (nullable = true)
//  |    |    |    |-- _2: double (nullable = false)
//  |    |    |-- _2: struct (nullable = true)
//  |    |    |    |-- _1: string (nullable = true)
//  |    |    |    |-- _2: array (nullable = true)
//  |    |    |    |    |-- element: double (containsNull = false)
//  |    |-- bb: struct (nullable = true)
//  |    |    |-- _1: struct (nullable = true)
//  |    |    |    |-- _1: string (nullable = true)
//  |    |    |    |-- _2: double (nullable = false)
//  |    |    |-- _2: struct (nullable = true)

scala> df2.show(false)
// +---+----------------------------------------------------------------------------------------------------------------------------+
// |id |column2                                                                                                                     |
// +---+----------------------------------------------------------------------------------------------------------------------------+
// |a  |[[[aaa,1.0],[bbb,WrappedArray(2.0, 2.0)]],[[aaa,8.0],[bbb,WrappedArray(3.0, 4.0)]],[[aaa,4.0],[bbb,WrappedArray(9.0, 3.0)]]]|
// |b  |[[[aaa,1.0],[bbb,WrappedArray(3.0, 2.0)]],[[aaa,8.0],[bbb,WrappedArray(3.0, 3.0)]],[[aaa,4.0],[bbb,WrappedArray(3.0, 9.0)]]]|
// |c  |[[[aaa,1.0],[bbb,WrappedArray(3.0, 2.0)]],[[aaa,8.0],[bbb,WrappedArray(3.0, 3.0)]],[[aaa,4.0],[bbb,WrappedArray(3.0, 9.0)]]]|
// |d  |[[[aaa,1.0],[bbb,WrappedArray(3.0, 2.0)]],[[aaa,8.0],[bbb,WrappedArray(3.0, 3.0)]],[[aaa,4.0],[bbb,WrappedArray(3.0, 9.0)]]]|
// +---+----------------------------------------------------------------------------------------------------------------------------+

更新:要跟进问题更新,我将使用DataFrame df2继续拉出嵌套字段。这有点棘手,但在这里:

val df3 = df2.select(
    $"id",
    struct(
        struct($"column2.aa._1".getItem("_2").as("aaa"),$"column2.aa._2".getItem("_2").as("bbb")).as("aa"),
        struct($"column2.bb._1".getItem("_2").as("aaa"),$"column2.bb._2".getItem("_2").as("bbb")).as("bb"),
        struct($"column2.cc._1".getItem("_2").as("aaa"),$"column2.cc._2".getItem("_2").as("ccc")).as("cc")
    ).as("column2")
)
// df3: org.apache.spark.sql.DataFrame = [id: string, column2: struct<aa:struct<aaa:double,bbb:array<double>>,bb:struct<aaa:double,bbb:array<double>>,cc:struct<aaa:double,ccc:array<double>>>]

这里没有魔力,你需要很好地理解struct类型和嵌套类型的体操才能将它组合起来得到预期的输出:

df3.printSchema
// root
//  |-- id: string (nullable = true)
//  |-- column2: struct (nullable = false)
//  |    |-- aa: struct (nullable = false)
//  |    |    |-- aaa: double (nullable = true)
//  |    |    |-- bbb: array (nullable = true)
//  |    |    |    |-- element: double (containsNull = false)
//  |    |-- bb: struct (nullable = false)
//  |    |    |-- aaa: double (nullable = true)
//  |    |    |-- bbb: array (nullable = true)
//  |    |    |    |-- element: double (containsNull = false)
//  |    |-- cc: struct (nullable = false)
//  |    |    |-- aaa: double (nullable = true)
//  |    |    |-- ccc: array (nullable = true)
//  |    |    |    |-- element: double (containsNull = false)

注意:使用spark-shell 2.0进行测试