Question

我正在尝试创建一个模式来验证正在加载的GeoJSON文件：

validSchema = StructType([
StructField("type", StringType()),
StructField("geometry", StructType([
  StructField("coordinates", ArrayType(DoubleType())), # POINT
  StructField("coordinates", ArrayType(ArrayType(ArrayType(DoubleType())))),  # POLYGON
  StructField("coordinates", ArrayType(ArrayType(DoubleType()))), # LINESTRING
  StructField("type", StringType(), False)
]), False),
StructField("properties", MapType(StringType(), StringType()))
])

df = spark.read.option("multiline","true").json(src_data,mode="PERMISSIVE",schema=validSchema)

问题是我有三种“坐标”来满足有效的GeoJSON类型。但是，只有最后一条规则正在运行，我假设它基于顺序优先于前两条规则。

是否有指定架构，说其中一个坐标架构必须匹配？

现在我能看到的唯一方法是创建三个模式和三个导入，这意味着扫描所有数据三次（我有5TB的数据，这看起来很疯狂）。

示例geoJSON数据：

{
  "type": "Feature",
  "properties": {},
  "geometry": {
    "type": "Polygon",
    "coordinates": [[[ -0.144195556640625,52.019120643633386],
        [-0.127716064453125,52.00052411347729],
        [-0.10848999023437499,52.01193653675363],
        [-0.12359619140625,52.02883848153626],
        [-0.144195556640625,52.019120643633386]]]
  }
},
{
  "type": "Feature",
  "properties": {},
  "geometry": {
    "type": "LineString",
    "coordinates": [[-0.196380615234375,52.11283076186275],
      [-0.1263427734375,52.07739600418385]]
      }
},
{
  "type": "Feature",
  "properties": {},
  "geometry": {
    "type": "Point",
    "coordinates": [-0.1641082763671875, 52.06051241654061]
  }
}

由于

Answer 1

是否有指定架构，说其中一个坐标架构必须匹配？

UserDefinedTypes（不再支持）尽管如此，Column中的所有值必须具有相同的形状，因此您不能拥有array<array<array<double>>>，array<array<double>>和{{ 1}}同时。

您可以完全跳过解析

array<double>

然后将validSchema = StructType([ StructField("type", StringType()), StructField("geometry", StructType([ StructField("coordinates", StringType()), StructField("type", StringType(), False) ]), False), StructField("properties", MapType(StringType(), StringType())) ])解析为三个单独的列：

udf

使用架构验证在pyspark中加载geoJSON

1 个答案: