更改DataFrame

时间:2018-05-01 21:00:47

标签: scala apache-spark dataframe

我有两个级别嵌套字段的数据框

 root
 |-- request: struct (nullable = true)
 |    |-- dummyID: string (nullable = true)
 |    |-- data: struct (nullable = true)
 |    |    |-- fooID: string (nullable = true)
 |    |    |-- barID: string (nullable = true)

我想在此处更新fooId列的值。我可以使用此问题作为参考How to add a nested column to a DataFrame

来更新示例dummyID列的第一个级别的值

输入数据:

{
    "request": {
        "dummyID": "test_id",
        "data": {
            "fooID": "abc",
            "barID": "1485351"
        }
    }
}

输出数据:

{
    "request": {
        "dummyID": "test_id",
        "data": {
            "fooID": "def",
            "barID": "1485351"
        }
    }
}

我怎样才能使用Scala?

2 个答案:

答案 0 :(得分:1)

这是此问题的通用解决方案,它使得可以基于递归遍历中应用的任意函数,在任何级别更新任何数量的嵌套值:

onChartSelectionEvent = (function(self) {
    console.log(self)
    return function(chartRef: any, chartEvent:any){
    console.log(chartRef.xAxis[0].min)
    console.log(chartRef.xAxis[0].max)
    console.log(self)
    }
    // `this` is bound to the chartRef that is emmited by the Highcharts event
    // How can I get a hold of the angular component class (AppComponent) instead?
    //this.selection.next(chartRef.xAxis[0].max)
})(this)

这实际上等效于通常的“仅将整个结构重新定义为投影”解决方案,但是它可以自动使用原始结构重新嵌套字段并保留可为空性/元数据(当您手动重新定义结构时会丢失)。令人讨厌的是,在创建投影时(保留)无法保留这些属性,因此上面的代码手动重新定义了架构。

示例应用程序:

def mutate(df: DataFrame, fn: Column => Column): DataFrame = {
  // Get a projection with fields mutated by `fn` and select it
  // out of the original frame with the schema reassigned to the original
  // frame (explained later)
  df.sqlContext.createDataFrame(df.select(traverse(df.schema, fn):_*).rdd, df.schema)
}

def traverse(schema: StructType, fn: Column => Column, path: String = ""): Array[Column] = {
  schema.fields.map(f => {
    f.dataType match {
      case s: StructType => struct(traverse(s, fn, path + f.name + "."): _*)
      case _ => fn(col(path + f.name))
    }
  })
}

此方法的局限性在于,它不能添加新字段,只能更新现有字段(尽管可以将地图更改为flatMap,并且可以为此返回Array [Column]的函数,如果您不关心保留可空性/ metadata)。

此外,这是Dataset [T]的更通用版本:

case class Organ(name: String, count: Int)
case class Disease(id: Int, name: String, organ: Organ)
case class Drug(id: Int, name: String, alt: Array[String])

val df = Seq(
  (1, Drug(1, "drug1", Array("x", "y")), Disease(1, "disease1", Organ("heart", 2))),
  (2, Drug(2, "drug2", Array("a")), Disease(2, "disease2", Organ("eye", 3)))
).toDF("id", "drug", "disease")

df.show(false)

+---+------------------+-------------------------+
|id |drug              |disease                  |
+---+------------------+-------------------------+
|1  |[1, drug1, [x, y]]|[1, disease1, [heart, 2]]|
|2  |[2, drug2, [a]]   |[2, disease2, [eye, 3]]  |
+---+------------------+-------------------------+

// Update the integer field ("count") at the lowest level:
val df2 = mutate(df, c => if (c.toString == "disease.organ.count") c - 1 else c)
df2.show(false)

+---+------------------+-------------------------+
|id |drug              |disease                  |
+---+------------------+-------------------------+
|1  |[1, drug1, [x, y]]|[1, disease1, [heart, 1]]|
|2  |[2, drug2, [a]]   |[2, disease2, [eye, 2]]  |
+---+------------------+-------------------------+

// This will NOT necessarily be equal unless the metadata and nullability
// of all fields is preserved (as the code above does)
assertResult(df.schema.toString)(df2.schema.toString)

答案 1 :(得分:0)

一种方法,虽然繁琐是通过显式引用原始结构的每个元素来完全解压缩并重新创建列。

dataFrame.withColumn("person", 
    struct(
        col("person.age").alias("age),
        struct(
            col("person.name.first").alias("first"),
            lit("some new value").alias("last")).alias("name")))