Question

我一直在提及以下文章

Spark cast column to sql type stored in string

我正在寻找pyspark中的等效代码。

问题是上述帖子中的答案使用了classof[DataTypes]，但pyspark中没有DataTypes类。

我要做的是动态创建架构。所以，我有一个List如下：

>>> sourceToHiveTypeList
['TimestampType', 'TimestampType', 'StringType', 'StringType', 'IntegerType', 'DoubleType']

我定义了一个UDF

def TableASchema(columnName, columnType): 
   return StructType([
       StructField(columnName[0], getattr(pyspark.sql.types,columnType[0]), nullable = True),
       StructField(columnName[1], getattr(pyspark.sql.types,columnType[1]), nullable = True),
       StructField(columnName[2], getattr(pyspark.sql.types,columnType[2]), nullable = True),
       StructField(columnName[3], getattr(pyspark.sql.types,columnType[3]), nullable = True),
       StructField(columnName[4], getattr(pyspark.sql.types,columnType[4]), nullable = True),
       StructField(columnName[5], getattr(pyspark.sql.types,columnType[5]), nullable = True)
      ])

当我调用上面的UDF时，我收到错误：

>>> schema = TableASchema(headerColumns, sourceToHiveTypeList)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in TableASchema
AttributeError: 'module' object has no attribute 'TimestampType()'

Answer 1

如果您正在寻找一种仅适用于原子类型的解决方案（与链接问题中的解决方案相同）：

import pyspark.sql.types

def type_for_name(s):
    return getattr(pyspark.sql.types, s)()

type_for_name("StringType")
# StringType

复杂类型可以使用eval进行解析，但由于安全隐患，我会非常小心：

def type_for_name_(s):
    types = {
        t: getattr(pyspark.sql.types, t) 
        for t  in dir(pyspark.sql.types) if t.endswith("Type")}
    t = eval(s, types, {})
    return t if isinstance(t, pyspark.sql.types.DataType) else t()

type_for_name_("DecimalType(10, 2)")
# DecimalType(10,2)

一般情况下，我建议使用短字符串（即string，double，struct<x:integer,y:integer>，可以直接使用：

col("foo").cast("integer")

如果您需要更复杂的表示，请使用JSON。

Answer 2

def toDataType（dataType：String）：DataType = { val模块= runtimeMirror.staticModule（“ org.apache.spark.sql.types。” + dataType） runtimeMirror.reflectModule（module）.instance.asInstanceOf [DataType] }

在spark中将字符串名称转换为sql数据类型

2 个答案: