I have json data that looks like this (1 object per row):
{
"id": "c428c2e2-c30c-4864-8c12-458ead4b17f5",
"weight": 73,
"topics": {
"type": 1,
"values": [
1,
2,
3
]
}
}
When I read in the data without a specified schema, Spark infers topics.values
to be an ArrayType
but I need it to be a VectorUDT
for doing ML tasks. So I am trying to read in the data set using a schema as follows:
schema = StructType([
StructField("id", StringType()),
StructField("weight", IntegerType()),
StructField("topics", StructType([
StructField("type", IntegerType()),
StructField("values", VectorUDT())
]))
])
When I do this I see the type (using dtype
) of the data frame as follows:
[('id', 'string'), ('weight', 'int'), ('topics', 'struct<type:int,values:vector>')]
But there seems to be no actual data in the data frame, as show by using first()
:
Row(id=None, weight=None, topics=None)
And when I write the data frame to disk, I just see empty braces on each line. Seems odd! What am I doing wrong?
答案 0 :(得分:1)
Well, I figured it out:
Just needed to change the schema a bit:
schema = StructType([
StructField("id", StringType()),
StructField("weight", DoubleType()),
StructField("topics", VectorUDT())
])
Now it makes sense.