Question

我正在尝试从S3位置读取json，我需要在pyspark的输入数据帧中将该json用作模式。我可以从本地读取类似的json，并可以在数据框上设置架构，以下是代码：

inputDf = spark.read.option("delimiter", "|").csv(input_file)
with open('input.json', 'r') as S:  # path to myschema file
     saved_schema = json.load(S)

targetDf = spark.createDataFrame(inputDf.rdd, StructType.fromJson(saved_schema))

上面的代码工作正常。

现在我可以在S3上使用我的架构了，我可以使用以下代码进行阅读：

s3 = boto3.resource('s3')
    content_object = s3.Object('bucket-location', 'config/input.json')

    file_content = content_object.get()['Body'].read().decode('utf-8')

下面是输入json：

{
  "type" : "struct",
  "fields" : [ {
    "name" : "name",
    "type" : "string",
    "nullable" : true,
    "metadata" : {}

  }, {
    "name" : "id",
    "type" : "integer",
    "nullable" : true,
    "metadata" : {}

  }]

我按照下面的链接尝试加载和转储json方法，但是没有运气： PySpark, importing schema through JSON file

感谢帮助。

隐含包含架构详细信息的json文件中的pyspark架构

0 个答案: