Spark-将平面数据框映射到可配置的嵌套json模式

时间:2019-04-25 12:49:17

标签: json scala apache-spark case-class

我有一个5-6列的平面数据框。我想嵌套它们并将其转换为嵌套的数据框,以便随后将其写入拼花格式。

但是,我不想使用案例类,因为我试图使代码尽可能地可配置。我坚持这一部分,需要一些帮助。

我的输入:

ID ID-2 Count(apple) Count(banana) Count(potato) Count(Onion)

 1  23    1             0             2             0

 2  23    0             1             0             1

 2  29    1             0             1             0

我的输出:

第1行:

{
  "id": 1,
  "ID-2": 23,
  "fruits": {
    "count of apple": 1,
    "count of banana": 0
  },
  "vegetables": {
    "count of potato": 2,
    "count of onion": 0
  }
} 

我尝试在spark数据框中使用“映射”功能,将值映射到case类。但是,我将使用这些字段的名称,并且可能也会更改它们。

我不想维护案例类并将行映射到sql列名,因为每次都会涉及代码更改。

我正在考虑使用要与数据框的列名保持一致的列名维护一个Hashmap。例如,在示例中,我将“ Count(apple)”映射到“ count of apple”。但是,我想不出一种简单的好方法来将架构作为配置传递,然后将其映射到我的代码中

3 个答案:

答案 0 :(得分:2)

这是一种使用scala socket.io-parser decoded 2/5ve6v0iuzfrjf9cwgv2a8i,["returnData","test"] as {"type":2,"nsp":"/5ve6v0iuzfrjf9cwgv2a8i","data":["returnData","test"]} +29ms socket.io:socket got packet {"type":2,"nsp":"/5ve6v0iuzfrjf9cwgv2a8i","data":["returnData","test"]} +37ms socket.io:socket emitting event ["returnData","test"] +5ms socket.io:socket dispatching an event ["returnData","test"] +2ms 类型通过以下数据集创建列映射的方法:

Map

首先,我们使用val data = Seq( (1, 23, 1, 0, 2, 0), (2, 23, 0, 1, 0, 1), (2, 29, 1, 0, 1, 0)).toDF("ID", "ID-2", "count(apple)", "count(banana)", "count(potato)", "count(onion)") 集合和负责映射的函数来声明映射:

scala.collection.immutable.Map

该函数遍历给定数据帧的列,并使用import org.apache.spark.sql.{Column, DataFrame} val colMapping = Map( "count(banana)" -> "no of banana", "count(apple)" -> "no of apples", "count(potato)" -> "no of potatos", "count(onion)" -> "no of onions") def mapColumns(colsMapping: Map[String, String], df: DataFrame) : DataFrame = { val mapping = df .columns .map{ c => if (colsMapping.contains(c)) df(c).alias(colsMapping(c)) else df(c)} .toList df.select(mapping:_*) } 标识具有公用键的列。然后,它会根据应用的映射返回更改名称(带有别名)的列。

mapping的输出:

mapColumns(colMapping, df).show(false)

最后,我们通过+---+----+------------+------------+-------------+------------+ |ID |ID-2|no of apples|no of banana|no of potatos|no of onions| +---+----+------------+------------+-------------+------------+ |1 |23 |1 |0 |2 |0 | |2 |23 |0 |1 |0 |1 | |2 |29 |1 |0 |1 |0 | +---+----+------------+------------+-------------+------------+ 类型生成水果和蔬菜:

struct

请注意,在完成转换之后,我们将删除colMapping集合的所有cols。

输出:

df1.withColumn("fruits", struct(col(colMapping("count(banana)")), col(colMapping("count(apple)"))))
.withColumn("vegetables", struct(col(colMapping("count(potato)")), col(colMapping("count(onion)"))))
.drop(colMapping.values.toList:_*)
.toJSON
.show(false)

答案 1 :(得分:0)

::: scala中的(双冒号)在scala列表中被视为“ cons”。 这是创建Scala列表或将元素插入现有可变列表的方法。

scala> val aList = 24 :: 34 :: 56 :: Nil
aList: List[Int] = List(24, 34, 56)

scala> 99 :: aList
res3: List[Int] = List(99, 24, 34, 56)

在第一个示例中,Nil是空列表,并被视为最右边的cons操作的结尾。

但是

scala> val anotherList = 23 :: 34
<console>:12: error: value :: is not a member of Int
       val anotherList = 23 :: 34

由于没有要插入的现有列表,因此会引发错误。

答案 2 :(得分:0)

val df = spark.sqlContext.read.option("header","true").csv("/sampleinput.txt")

val df1 = df.withColumn("fruits",struct("Count(apple)","Count(banana)") ).withColumn("vegetables",struct("Count(potato)","Count(Onion)")).groupBy("ID","ID-2").agg(collect_list("fruits") as "fruits",collect_list("vegetables") as "vegetables").toJSON 

df1.take(1)

输出:

{"ID":"2","ID-2":"23","fruits":[{"Count(apple)":"0","Count(banana)":"1"}],"vegetables":[{"Count(potato)":"0","Count(Onion)":"1"}]}