数据结构:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}
现在我想将数据加载到数据框中,并希望将zip附加到loc。 loc列名称应该相同(loc)。转换后的数据应该是这样的:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}
没有RDD。我需要一个数据帧操作来实现这一点,最好使用withColumn
函数。我怎么能这样做?
答案 0 :(得分:0)
给定数据结构为
val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""
您可以将其转换为数据框
val df = spark.read.json(sc.parallelize(jsonString::Nil))
会给你
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
现在要获得所需的输出,你需要将struct Emp列分开以分隔列和使用udf函数中的地址数组列来获得所需的结果
import org.apache.spark.sql.functions._
def attachZipWithLoc = udf((array: Seq[Row])=> array.map(row => address(row.getAs[String]("loc")+row.getAs[String]("Zip"), row.getAs[String]("Zip"))))
df.select($"Emp.*")
.withColumn("Address", attachZipWithLoc($"Address"))
.select(struct($"Name".as("Name"), $"Sal".as("Sal"), $"Address".as("Address")).as("Emp"))
address
类中的udf
为case class
case class address(loc: String, Zip: String)
应该给你
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
现在要获取 json ,你可以使用.toJSON
,你应该得到
+-----------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+