Question

I have json data that looks like this (1 object per row):

{
  "id": "c428c2e2-c30c-4864-8c12-458ead4b17f5",
  "weight": 73,
  "topics": {
    "type": 1,
    "values": [
      1,
      2,
      3
    ]
  }
}

When I read in the data without a specified schema, Spark infers topics.values to be an ArrayType but I need it to be a VectorUDT for doing ML tasks. So I am trying to read in the data set using a schema as follows:

    schema = StructType([
        StructField("id", StringType()),
        StructField("weight", IntegerType()),
        StructField("topics", StructType([
            StructField("type", IntegerType()),
            StructField("values", VectorUDT())
        ]))
    ])

When I do this I see the type (using dtype) of the data frame as follows:

[('id', 'string'), ('weight', 'int'), ('topics', 'struct<type:int,values:vector>')]

But there seems to be no actual data in the data frame, as show by using first():

Row(id=None, weight=None, topics=None)

And when I write the data frame to disk, I just see empty braces on each line. Seems odd! What am I doing wrong?

Answer 1

Well, I figured it out:

Just needed to change the schema a bit:

schema = StructType([
                StructField("id", StringType()),
                StructField("weight", DoubleType()),
                StructField("topics", VectorUDT())
            ])

Now it makes sense.

PySpark issue loading json data with schema

1 个答案: