将结构数组分解为pyspark中的列

时间:2019-10-12 07:23:16

标签: pyspark explode

我想将结构数组分解为列(由struct字段定义)。例如

    root
 |-- news_style_super: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- sbox_ctr: double (nullable = true)
 |    |    |    |-- wise_ctr: double (nullable = true)

应转换为

|-- name: string (nullable = true)
|-- sbox_ctr: double (nullable = true)
|-- wise_ctr: double (nullable = true)

我该怎么做?

2 个答案:

答案 0 :(得分:0)

当我这样做时,它将无法正常工作

 dff = df1.select("context.content_feature.news_style_super")
 print dff.printSchema()
 df2 = dff.select(explode("name").alias("tmp")) .select("tmp.*")

回溯(最近通话最近):   在第367行中输入“ /tmp/zeppelin_pyspark-1808705980431719035.py”文件     引发Exception(traceback.format_exc()) 例外:追溯(最近一次呼叫最近):   文件“ /tmp/zeppelin_pyspark-1808705980431719035.py”,第355行,在     exec(代码,_zcUserQueryNameSpace)   文件“”,第141行,位于   在选择行中的文件“ /home/work/lxc/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/dataframe.py”中     jdf = self._jdf.select(self._jcols(* cols))   在调用中的文件“ /home/work/lxc/interpreter/spark/pyspark/py4j-0.10.4-src.zip/py4j/java_gateway.py”中,行1133     答案,self.gateway_client,self.target_id,self.name)   装饰中的文件“ /home/work/lxc/interpreter/spark/pyspark/pyspark.zip/pyspark/sql/utils.py”,第69行     引发AnalysisException(s.split(':',1)[1],stackTrace) AnalysisException:给定输入列,u“无法解析'name':[news_style_super] ;; \ n'专案[explode('name)AS tmp#1559] \ n +-专案[context#1411.content_feature.news_style_super AS news_style_super#1556] \ n +-GlobalLimit 1 \ n +-LocalLimit 1 \ n +-Relation [context#1411,gr_context#1412,request_feature#1413,sequence_feature#1414,session_feature#1415,sv_session_feature#1416,user_feature#1417, user_recommend_feature#1418,vertical_user_feature#1419] json \ n“

答案 1 :(得分:0)

def get_final_dataframe(pathname, df):
cur_names = pathname.split(".")
if len(cur_names) > 1:
    root_name = cur_names[0]
    delimiter = "."
    new_path_name = delimiter.join(cur_names[1:len(cur_names)])

    for field in df.schema.fields:
        if field.name == root_name:
            if type(field.dataType) == ArrayType:
                return get_final_dataframe(pathname, df.select(explode(root_name).alias(root_name)))
            elif type(field.dataType) == StructType:
                if hasColumn(df, delimiter.join(cur_names[0:2])):
                    return get_final_dataframe(new_path_name, df.select(delimiter.join(cur_names[0:2])))
                else:
                    return -1, -1
            else:
                return -1, -1

else:
    root_name = cur_names[0]
    for field in df.schema.fields:
        if field.name == root_name:
            if type(field.dataType) == StringType:
                return df, "string"
            elif type(field.dataType) == LongType:
                return df, "numeric"
            elif type(field.dataType) == DoubleType:
                return df, "numeric"
            else:
                return df, -1

return -1, -1

然后,您可以

key = "a.b.c.name"
# key = "context.content_feature.tag.name"
df2, field_type = get_final_dataframe(key, df1)