我对pyspark很陌生,这个问题令我感到困惑。基本上,我正在寻找一种通过structType或ArrayType进行类型转换的可缩放方法。
我的数据模式示例:
root
|-- _id: string (nullable = true)
|-- created: timestamp (nullable = true)
|-- card_rates: struct (nullable = true)
| |-- rate_1: integer (nullable = true)
| |-- rate_2: integer (nullable = true)
| |-- rate_3: integer (nullable = true)
| |-- card_fee: integer (nullable = true)
| |-- payment_method: string (nullable = true)
|-- online_rates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- rate_1: integer (nullable = true)
| | |-- rate_2: integer (nullable = true)
| | |-- online_fee: double (nullable = true)
|-- updated: timestamp (nullable = true)
如您在此处看到的,card_rates
是结构体,而online_rates
是结构体数组。我正在寻找遍历以上所有字段并有条件地进行类型转换的方法。理想情况下,如果应该是数字,则应将其转换为双精度型;如果应该是字符串,则应将其转换为字符串型。我需要循环,因为这些rate_*
字段可能会随着时间增长。
但是现在,我对能够循环播放它们并将它们全部转换为字符串表示满意,因为我对pyspark非常陌生,并且仍在尝试中。
我想要的输出模式:
root
|-- _id: string (nullable = true)
|-- created: timestamp (nullable = true)
|-- card_rates: struct (nullable = true)
| |-- rate_1: double (nullable = true)
| |-- rate_2: double (nullable = true)
| |-- rate_3: double (nullable = true)
| |-- card_fee: double (nullable = true)
| |-- payment_method: string (nullable = true)
|-- online_rates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- rate_1: double (nullable = true)
| | |-- rate_2: double (nullable = true)
| | |-- online_fee: double (nullable = true)
|-- updated: timestamp (nullable = true)
我正在提出如何执行此操作的想法。
我从这里获得了参考:PySpark convert struct field inside array to string
但是此解决方案对字段进行硬编码,并且不会真正在字段上循环。
请帮助。
答案 0 :(得分:2)
这是借助StructType.simpleString
和_parse_datatype_string
内置函数的一种解决方案:
from pyspark.sql.types import *
df_schema = StructType([
StructField("_id", StringType(), True),
StructField("created", TimestampType(), True),
StructField("card_rates", StructType([
StructField("rate_1", IntegerType(), True),
StructField("rate_2", IntegerType(), True),
StructField("rate_3", IntegerType(), True),
StructField("card_fee", IntegerType(), True),
StructField("card_fee", IntegerType(), True)])),
StructField("online_rates", ArrayType(
StructType(
[
StructField("rate_1", IntegerType(),True),
StructField("rate_2", IntegerType(),True),
StructField("online_fee", DoubleType(),True)
]),True),True),
StructField("updated", TimestampType(), True)])
schema_str = df_schema.simpleString() # this gives -> struct<_id:string,created:timestamp,card_rates:struct<rate_1:int,rate_2:int,rate_3:int,card_fee:int, card_fee:int>,online_rates:array<struct<rate_1:int,rate_2:int,online_fee:double>>,updated:timestamp>
double_schema = schema_str.replace(':int', ':double')
# convert back to StructType
final_schema = _parse_datatype_string(double_schema)
final_schema
schema.simpleString
将您的模式转换为简单的字符串:int
替换为:double
_parse_datatype_string
将修改后的字符串模式转换为StructType 更新:
为了避免@jxc指出的反引号问题,更好的解决方案是对元素进行递归扫描,如下所示:
def transform_schema(schema):
if schema == None:
return StructType()
updated = []
for f in schema.fields:
if isinstance(f.dataType, IntegerType):
# if IntegerType convert to DoubleType
updated.append(StructField(f.name, DoubleType(), f.nullable))
elif isinstance(f.dataType, ArrayType):
# if ArrayType unpack the array type(elementType), do recursion then wrap results with ArrayType
updated.append(StructField(f.name, ArrayType(transform_schema(f.dataType.elementType))))
elif isinstance(f.dataType, StructType):
# if StructType do recursion
updated.append(StructField(f.name, transform_schema(f.dataType)))
else:
# else handle all the other cases i.e TimestampType, StringType etc
updated.append(StructField(f.name, f.dataType, f.nullable))
return StructType(updated)
# call the function with your schema
transform_schema(df_schema)
说明:该函数遍历模式中的每个项目(StructType),并尝试将int字段(StructField)转换为双精度型。最后,将转换后的架构(StructType)传递到上一层(父StructType)。
输出:
StructType(List(
StructField(_id,StringType,true),
StructField(created,TimestampType,true),
StructField(card_rates,
StructType(List(StructField(rate_1,DoubleType,true),
StructField(rate_2,DoubleType,true),
StructField(rate_3,DoubleType,true),
StructField(card_fee,DoubleType,true),
StructField(card_fee,DoubleType,true))),true),
StructField(online_rates,ArrayType(
StructType(List(
StructField(rate_1,DoubleType,true),
StructField(rate_2,DoubleType,true),
StructField(online_fee,DoubleType,true))),true),true),
StructField(updated,TimestampType,true)))