我想将结构数组分解为列(由struct字段定义)。例如
root
|-- news_style_super: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- sbox_ctr: double (nullable = true)
| | | |-- wise_ctr: double (nullable = true)
应转换为
|-- name: string (nullable = true)
|-- sbox_ctr: double (nullable = true)
|-- wise_ctr: double (nullable = true)
我该怎么做?
答案 0 :(得分:0)
当我这样做时,它将无法正常工作
dff = df1.select("context.content_feature.news_style_super")
print dff.printSchema()
df2 = dff.select(explode("name").alias("tmp")) .select("tmp.*")
回溯(最近通话最近):
在第367行中输入“ /tmp/zeppelin_pyspark-1808705980431719035.py”文件
引发Exception(traceback.format_exc())
例外:追溯(最近一次呼叫最近):
文件“ /tmp/zeppelin_pyspark-1808705980431719035.py”,第355行,在
exec(代码,_zcUserQueryNameSpace)
文件“”,第141行,位于
在选择行中的文件“ /home/work/lxc/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/dataframe.py”中
jdf = self._jdf.select(self._jcols(* cols))
在调用中的文件“ /home/work/lxc/interpreter/spark/pyspark/py4j-0.10.4-src.zip/py4j/java_gateway.py”中,行1133
答案,self.gateway_client,self.target_id,self.name)
装饰中的文件“ /home/work/lxc/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/utils.py”,第69行
引发AnalysisException(s.split(':',1)[1],stackTrace)
AnalysisException:给定输入列,u“无法解析'name
':[news_style_super] ;; \ n'专案[explode('name)AS tmp#1559] \ n +-专案[context#1411.content_feature.news_style_super AS news_style_super#1556] \ n +-GlobalLimit 1 \ n +-LocalLimit 1 \ n +-Relation [context#1411,gr_context#1412,request_feature#1413,sequence_feature#1414,session_feature#1415,sv_session_feature#1416,user_feature#1417, user_recommend_feature#1418,vertical_user_feature#1419] json \ n“
答案 1 :(得分:0)
def get_final_dataframe(pathname, df):
cur_names = pathname.split(".")
if len(cur_names) > 1:
root_name = cur_names[0]
delimiter = "."
new_path_name = delimiter.join(cur_names[1:len(cur_names)])
for field in df.schema.fields:
if field.name == root_name:
if type(field.dataType) == ArrayType:
return get_final_dataframe(pathname, df.select(explode(root_name).alias(root_name)))
elif type(field.dataType) == StructType:
if hasColumn(df, delimiter.join(cur_names[0:2])):
return get_final_dataframe(new_path_name, df.select(delimiter.join(cur_names[0:2])))
else:
return -1, -1
else:
return -1, -1
else:
root_name = cur_names[0]
for field in df.schema.fields:
if field.name == root_name:
if type(field.dataType) == StringType:
return df, "string"
elif type(field.dataType) == LongType:
return df, "numeric"
elif type(field.dataType) == DoubleType:
return df, "numeric"
else:
return df, -1
return -1, -1
然后,您可以
key = "a.b.c.name"
# key = "context.content_feature.tag.name"
df2, field_type = get_final_dataframe(key, df1)