Question

我正在尝试为dataframe创建手动架构。我传入的数据是从json创建的RDD。这是我的初始数据：

json2 = sc.parallelize(['{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}'])

然后是指定架构的方式：

schema = StructType(fields=[
    StructField(
        name='name',
        dataType=StringType(),
        nullable=True
    ),
    StructField(
        name='pandas',
        dataType=ArrayType(
            StructType(
                fields=[
                    StructField(
                        name='id',
                        dataType=StringType(),
                        nullable=False
                    ),
                    StructField(
                        name='zip',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='pt',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='happy',
                        dataType=BooleanType(),
                        nullable=False
                    ),
                    StructField(
                        name='attributes',
                        dataType=ArrayType(
                            elementType=DoubleType(),
                            containsNull=False
                        ),
                        nullable=True

                    )
                ]
            ),
            containsNull=True
        ),
        nullable=True
    )
])

当我使用sqlContext.createDataFrame(json2, schema)然后尝试对结果show()执行dataframe时，我收到以下错误：

ValueError: Unexpected tuple '{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}' with StructType

Answer 1

首先json2只是RDD[String]。 Spark对用于编码数据的序列化格式没有特别的了解。此外，它需要RDD或Row或某些产品，显然不是这样。

在Scala中你可以使用

sqlContext.read.schema(schema).json(rdd)

RDD[String]，但有两个问题：

此方法无法在PySpark中直接访问
即使是您创建的架构也只是无效：
- pandas是struct而非array
- pandas.happy不是string和boolean
- pandas.attributes string而不是array

Schema仅用于避免类型推断，而不用于类型转换或任何其他转换。如果您想要转换数据，您必须先解析它：

def parse(s: str) -> Row:
    return ...

rdd.map(parse).toDF(schema)

假设你有这样的JSON（固定类型）：

{"name": "mission", "pandas": {"attributes": [0.4, 0.5], "pt": "giant", "id": "1", "zip": "94110", "happy": true}}

正确的架构如下所示

StructType([
    StructField("name", StringType(), True),
    StructField("pandas", StructType([
        StructField("attributes", ArrayType(DoubleType(), True), True),
        StructField("happy", BooleanType(), True),
        StructField("id", StringType(), True),
        StructField("pt", StringType(), True),
        StructField("zip", StringType(), True))],
    True)])

DataFrame - ValueError：StructType

1 个答案: