Question

在尝试通过Spark SQL向其传递行列表来创建DataFrame时，如下所示：

O(n^3)

结果DataFrame的架构为：

some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
             {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
spark.createDataFrame([Row(**d) for d in some_data]).printSchema()

我希望模式是适当的nulls的{{1}}-通过对值类型的一些Python反映来推断。为什么不是这样？在这种情况下，除了明确提供架构之外，我还能做些什么？

Answer 1

之所以发生这种情况，是因为该结构未对您的意思进行编码。如SQL guide中所述，Python dict被映射到MapType。

要使用结构，您应该使用嵌套的Rows（namedtuples are preferred in general, but require valid name identifiers）：

from pyspark.sql import Row

Outer = Row("some-column")
Inner = Row("timestamp", "strVal")

spark.createDataFrame([
    Outer([Inner(1353534535353, 'some-string')]),
    Outer([Inner(1353534535354, 'another-string')])
]).printSchema()

root
 |-- some-column: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- strVal: string (nullable = true)

使用当前的结构，可以使用中间JSON实现方案结果：

import json

spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()

root
 |-- some-column: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- strVal: string (nullable = true)
 |    |    |-- timestamp: long (nullable = true)

或显式模式：

from pyspark.sql.types import *

schema = StructType([StructField(
    "some-column", ArrayType(StructType([
        StructField("timestamp", LongType()), 
        StructField("strVal", StringType())])
))])

spark.createDataFrame(some_data, schema)

尽管最后一种方法可能并不完全健壮。

Spark SQL-createDataFrame错误的结构模式

1 个答案: