Spark SQL数据框

时间:2018-03-08 17:24:01

标签: scala apache-spark apache-spark-sql

数据结构:

{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}

现在我想将数据加载到数据框中,并希望将zip附加到loc。 loc列名称应该相同(loc)。转换后的数据应该是这样的:

{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}

没有RDD。我需要一个数据帧操作来实现这一点,最好使用withColumn函数。我怎么能这样做?

1 个答案:

答案 0 :(得分:0)

给定数据结构为

val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""

您可以将其转换为数据框

val df = spark.read.json(sc.parallelize(jsonString::Nil))

会给你

+-----------------------------------------------------+
|Emp                                                  |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+

//root
// |-- Emp: struct (nullable = true)
// |    |-- Address: array (nullable = true)
// |    |    |-- element: struct (containsNull = true)
// |    |    |    |-- Zip: string (nullable = true)
// |    |    |    |-- loc: string (nullable = true)
// |    |-- Name: string (nullable = true)
// |    |-- Sal: string (nullable = true)

现在要获得所需的输出,你需要将struct Emp列分开以分隔列使用udf函数中的地址数组列来获得所需的结果

import org.apache.spark.sql.functions._
def attachZipWithLoc = udf((array: Seq[Row])=> array.map(row => address(row.getAs[String]("loc")+row.getAs[String]("Zip"), row.getAs[String]("Zip"))))

df.select($"Emp.*")
  .withColumn("Address", attachZipWithLoc($"Address"))
  .select(struct($"Name".as("Name"), $"Sal".as("Sal"), $"Address".as("Address")).as("Emp"))

address类中的udfcase class

case class address(loc: String, Zip: String)

应该给你

+-----------------------------------------------------------+
|Emp                                                        |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+

//root
// |-- Emp: struct (nullable = false)
// |    |-- Name: string (nullable = true)
// |    |-- Sal: string (nullable = true)
// |    |-- Address: array (nullable = true)
// |    |    |-- element: struct (containsNull = true)
// |    |    |    |-- loc: string (nullable = true)
// |    |    |    |-- Zip: string (nullable = true)

现在要获取 json ,你可以使用.toJSON,你应该得到

+-----------------------------------------------------------------------------------------------------------------+
|value                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+