Question

我在pyspark中使用了df.printSchema()，它给了我带树结构的模式。现在我需要将其保存在变量或文本文件中。

我尝试过以下保存方法，但他们没有工作。

v = str(df.printSchema())  
print(v) 
#and
df.printSchema().saveAsTextFile(<path>)

我需要以下格式保存的架构

|-- COVERSHEET: struct (nullable = true)                              
 |    |-- ADDRESSES: struct (nullable = true)
 |    |    |-- ADDRESS: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _city: string (nullable = true)
 |    |    |    |-- _primary: long (nullable = true)
 |    |    |    |-- _state: string (nullable = true)
 |    |    |    |-- _street: string (nullable = true)
 |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- _zip: long (nullable = true)
 |    |-- CONTACTS: struct (nullable = true)
 |    |    |-- CONTACT: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |-- _name: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)

Answer 1

您需要SELECT max( cast(avg as unsigned) ) as avg FROM `abcd` SELECT min( cast(avg as unsigned) ) as avg FROM `abcd`（由于某种原因，我无法在python API中找到）

treeString

您可以将其转换为RDD并使用#v will be a string v = df._jdf.schema().treeString()

saveAsTextFile

或者使用特定于Python的API将String写入文件。

Answer 2

您还可以使用以下内容：

temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")

如何将printSchema的结果保存到PySpark中的文件中

2 个答案: