我正在尝试在Pyspark中使用爆炸数组功能,下面是代码-
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col.*")).printSchema()
输出-
root
|-- _name: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
当我尝试选择“ _name”列时,我可以这样做-
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col.*")).select(col("_name")).show(50,False)
但是在尝试访问“ 0”或“ 1”列时,此操作不起作用- 错误-
File "/usr/local/spark/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1614.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_696#696
是否可以重命名列“ 0”和“ 1”或通过在数据框中选择来提取它们?
答案 0 :(得分:1)
请尝试将col
列强制转换为struct<cola:string,colb:string>
。您可以在struct中选择自己的列名,例如,我已使用cola & colb
检查以下代码。
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col").cast("struct<cola:string,colb:string>")).select(col("_name"),col("col.cola"),col("col.colb")).printSchema()
root
|-- _name: string (nullable = true)
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
您还可以使用withColumnRenamed
df_map_transformation.select(col("_name") ,explode(
arrays_zip(
col("instances.Instance._name"),
col("instances.Instance._id") )
)
).select(col("_name"), col("col.*"))
.withColumnRenamed("0","cola")
.withColumnRenamed("1","colb")
答案 1 :(得分:0)
使用explode('col_name').alias('new_name')