Question

我有一个具有以下架构的数据框：

root
 |-- Id: integer (nullable = true)
 |-- Id_FK: integer (nullable = true)
 |-- Foo: integer (nullable = true)
 |-- Bar: string (nullable = true)
 |-- XPTO: string (nullable = true)

从该数据框中，我想创建一个具有列名的JSON文件，并键入如下

{
 "Id": "integer",
 "Id_FK": "integer",
 "Foo": "integer ",
 "Bar": "string",
 "XPTO": "string",
}

我正在尝试使用pyspark做到这一点，但是我找不到任何方法可以做到这一点。谁能帮我吗？

Answer 1

这里是一个解决方案，该解决方案首先填充在架构各列之间迭代的字典。然后，我们使用json.dumps将字典转换为字符串：

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import json

# sample schema
schema = StructType(
    [
      StructField("Id_FK" ,IntegerType()),
      StructField("Foo" ,IntegerType()),
      StructField("Bar" ,StringType()),
      StructField("XPTO" ,StringType())
    ])

# create a dictionary where each item will be a pair of col_name : col_type
dict = {}
for c in schema:
  dict[c.name] = str(c.dataType)

# convert to json string
data = json.dumps(dict)

# save to file
text_file = open("output.txt", "w")
text_file.write(data)
text_file.close()

如何使用列名和其他数据框的类型创建JSON

1 个答案: