Question

我正在尝试删除DF模式中数组下的数组下的列： @spektom的答案看起来是一个很好的起点： Dropping a nested column from Spark DataFrame 但是我的尝试是将非数组转换为数组的数组，降低数组的数组。任何帮助将非常感激 - 长期以来一直在黑客攻击。

val jsRdd = sc.parallelize(
  """{"type":"president",
   |"vals":{
   |"parents":[
   |{"name":"John Adams","salary":25000, "id":2, "children":[{"name":"John Q Adams", "id":6, "salary":25000}]},
   |{"name":"George Bush", "id":41,"salary": 200000, "children":[{"name":"George W Bush", "id":43, "salary":40000},{"name":"Jeb Bush", "id":-1}]}]
   |,"metadata":{"country":"US"}}
   |}""".stripMargin:: Nil)
val jsDf = sqlContext.read.json(jsRdd)

以下是示例输入：

val newJsdf = dropColumn(jsDf, "vals.parents.name")
newJsdf.printSchema

但是运行这个

root
 |-- type: string (nullable = true)
 |-- vals: struct (nullable = false)
 |    |-- metadata: struct (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- medium: string (nullable = true)
 |    |-- parents: array (nullable = false)
 |    |    |-- element: struct (containsNull = false)
 |    |    |    |-- children: array (nullable = true)
 |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |    |-- salary: long (nullable = true)
 |    |    |    |-- id: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- salary: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)

产生这个（注意vals.parents.id现在是一个数组）：

numberOfCastles = document.getElementById("attackerCastlesNum").value;
numberOfForts = document.getElementById("attackerFortNum").value;

val newJsdf = dropColumn（jsDf，“vals.parents.child.name”）爆炸。

仅供参考 - 我的情况：我们有几个月的数据存储在Parquet表中，但其中一个嵌套字段最近从Int变为Double。因此，我想重新编写较旧的分区，以便所有数据都可以通过单个镶木地板模式读取（如果表的元数据表明该列为Double，则Parquet会阻塞Int。）我很好，无论是从旧分区中删除此列还是将其转换为DOuble - 丢弃似乎更容易。

从Spark DataFrame中删除嵌套数组列

0 个答案: