Question

我已经在镶木地板文件上创建了一个数据框，现在可以看到该数据框架构。现在，我想在printschema输出之上创建一个数据框

df = spark.read.parquet("s3/location")
df.printschema()

输出看起来像[（cola，string），（colb，string）] 现在，我想在printschema的输出上创建数据框。最好的方法是什么

添加更多有关到目前为止已取得成就的投入-

df1 = sqlContext.read.parquet("s3://t1")
df1.printSchema()

我们得到了以下结果-

root
|-- Atp: string (nullable = true)
|-- Ccetp: string (nullable = true)
|-- Ccref: string (nullable = true)
|-- Ccbbn: string (nullable = true)
|-- Ccsdt: string (nullable = true)
|-- Ccedt: string (nullable = true)
|-- Ccfdt: string (nullable = true)
|-- Ccddt: string (nullable = true)
|-- Ccamt: string (nullable = true)

我们要创建具有两列的数据框-1）colname，2）数据类型

但是，如果我们运行以下代码-

schemaRDD = spark.sparkContext.parallelize([df1.schema.json()])
schema_df = spark.read.json(schemaRDD)

schema_df.show()

我们将在输出下方，在一行中获取整个列名和数据类型-

+--------------------+------+
|              fields|  type|
+--------------------+------+
|[[Atp,true,str...|struct|
+--------------------+------+

寻找类似

的输出

Atp| string 
Ccetp| string
Ccref| string
Ccbbn| string
Ccsdt| string
Ccedt| string
Ccfdt| string
Ccddt| string
Ccamt| string

Answer 1

不确定您使用的是哪种语言，但是在pyspark上我会这样做：

schemaRDD = spark.sparkContext.parallelize([df.schema.json()])
schema_df = spark.read.json(schemaRDD)

在printschema输出上创建数据框

2 个答案: