Question

我将json文件读入了数据框。 json可以具有特定于名称的结构字段消息，如下所示。

class1interface::readparameters

当我将数据从jsons读入数据框时，会得到如下所示的模式。

Json1
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"foo",
   "messages":[
      {
         "a":1810,
         "b":"hello",
         "c":390
      }
   ]
}

Json2
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"bar",
   "messages":[
      {
         "b":"my",
         "d":"world"
      }
   ]
}

很好。现在，当我保存到按名称分区的镶木地板文件时，如何在foo和bar分区中使用不同的架构？

root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)
 |    |    |-- d: string (nullable = true)

如果我从根路径读取数据时获得了包含foo和bar所有字段的架构，那很好。但是，当我从path / name = foo读取数据时，我期望的只是foo模式。

Answer 1

1. Partitioning & Storing as Parquet file:

如果您另存为镶木地板格式，则在阅读path/name=foo specify the schema时，包括所有 必填字段（，b，c），那么spark仅会加载这些字段。

如果我们 won't 指定架构，则所有字段（a，b，c，d）将包含在数据框中

EX:

schema=define structtype...schema
spark.read.schema(schema).parquet(path/name=foo).printSchema()

2.Partitioning & Storing as JSON/CSV file:

然后Spark 不会将b，d列添加到path/name=foo文件中，因此当我们仅读取 name = foo 目录时，不会得到数据中包含b,d列。

EX:

spark.read.json(path/name=foo).printSchema()
spark.read.csv(path/name=foo).printSchema()

Answer 2

您可以在将数据框保存到分区之前更改架构，因为这必须过滤分区记录，然后将其保存在相应的文件夹中

#this will select only not null columns which will drop col d from foo and a,c from bar
df = df.filter(f.col('name')='foo').select(*[c for c in df.columns if df.filter(f.col(c).isNotNull()).count() > 0])

#then save the df
df.write.json('path/name=foo')

现在每个分区都将具有不同的架构。

如何在实木复合地板分区中具有不同的架构

2 个答案: