递归地展平孩子并返回完整架构(Pyspark)

时间:2020-08-22 15:35:30

标签: pyspark nested

我有一个json文件,其中包含在属性“标签”下具有相同名称的集合的嵌套。这种特殊嵌套的数量各不相同。例如:

{ 
    "Id" : "001", 
    "Type" : "Work", 
    "Tag" : [
        {
            "Id" : "a123", 
            "Location" : [
                {
                    "LocName" : "Astro", 
                    "LocCode" : "AST"
                }
            ],  
            "displayName" : "Al"
        }, 
        {
            "Id" : "e789", 
            "Location" : [
                {
                    "LocName" : "Cosmos", 
                    "LocCode" : "COS"
                }
            ], 
            "displayName" : "Tom"
        }
    ], 
    "version" : 2
}

我试图递归地展平嵌套的孩子,以遵循此模式以这种形式获得最终输出。

root
 |-- Id: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Tag: struct (nullable = true)
 |    |-- Tag.Id: string (nullable = true)
 |    |-- Tag.Location: struct (nullable = true)
 |    |    |--Location.LocName:string (nullable = true)
 |    |    |--Location.LocCode:string (nullable = true)
 |    |-- Tag.displayname: string (nullable = true)
 |-- version: string (nullable = true)


+--+----+------+--------------------+--------------------+---------------+-------+
|Id|Type|Tag_Id|Tag_Location_LocName|Tag_Location_LocCode|Tag_displayName|version|
+--+----+------+--------------------+--------------------+---------------+-------+
001 Work  a123        Astro                  AST             Al             2
001 Work  e789        Cosmos                 COS             Tom            2

到目前为止,我们设法使用爆炸和嵌套的第一套嵌套,并且在递归部分上遇到了困难(并通过管道将扁平化的子代与其余属性结合在一起,成为新的行)。有人可以帮忙分享完成这项任务的方法吗?

2 个答案:

答案 0 :(得分:1)

因此,spark构建函数中没有当前方法可以执行此操作。但是,下面我为这种情况创建了一种方法。但是,关于此代码的一个假设是,我假设要尝试处理的字典的长度不是很大,无法一次性读取到内存中。

from pyspark.sql import Row

inputs = { 
    "Id" : "001", 
    "Type" : "Work", 
    "Tag" : [
        {
            "Id" : "a123", 
            "Location" : [
                {
                    "LocName" : "Astro", 
                    "LocCode" : "AST"
                }
            ],  
            "displayName" : "Al"
        }, 
        {
            "Id" : "e789", 
            "Location" : [
                {
                    "LocName" : "Cosmos", 
                    "LocCode" : "COS"
                }
            ], 
            "displayName" : "Tom"
        }
    ], 
    "version" : 2
}

# Need to get all possible columns names beforehand
# This is so we can avoid schema conflicts
def get_column_map(input_dict, columns=[], key_stack=[]):
  for k, v in input_dict.items():
    if type(v) is list:
      key_stack.append(k)
      for list_item in v:
        get_columns(list_item, columns, key_stack)
      key_stack.pop()
    elif type(v) is dict:
      key_stack.append(k)
      get_columns(list_item, columns, key_stack)
      key_stack.pop()
    else:
      column_name = "_".join(key_stack + [k])
      columns.append(column_name)
  l = list(set(columns))
  mapper = {}
  for item in l:
    mapper[item] = None
  return mapper

# After knowing the column names, I can populate them
# One trick is that you should process all non-dict or list items first
# So you can easily append when you are at the last child in the nest
def process_map(input_dict, column_dict, key_stack=[], rows=[]):
  def order_dict(x):
    if type(x[1]) != list and type(x[1]) != dict: 
      return 1 
    else: 
      return 0
    
  input_dict = sorted(
    input_dict.items(), 
    key=lambda x: order_dict(x), 
    reverse=True
  )
  
  last_child = True
  for k, v in input_dict:
    if type(v) is list:
      last_child = False
      key_stack.append(k)
      for list_item in v:
        process_map(list_item, column_dict, key_stack, rows)
      key_stack.pop()
    elif type(v) is dict:
      last_child = False
      key_stack.append(k)
      process_map(list_item, column_dict, key_stack, rows)
      key_stack.pop()
    else:
      column_name = "_".join(key_stack + [k])
      column_dict[column_name] = v
  if last_child:
    rows.append(Row(**column_dict))
  return rows


# Can put this in a main or leave it in a functional way at bottom
mapper = get_column_map(inputs)
rows = process_map(inputs, mapper)
final_df = spark.createDataFrame(rows)

通过在环境中运行此代码,我得到了 table/result

答案 1 :(得分:0)

鉴于您已经声明了spark数据框架,我们可以使用它来展平您的架构。您可以分两个步骤进行操作:

  1. 分解数组
  2. 展开结构
from pyspark.sql.types import StructType, StructField, ArrayType
from pyspark.sql.functions import explode_outer


def flatten(df):
    """
    Create a new dataframe with a flat schema based,
    exploding the arrays and flattening the structure. 
    """
    f_df = df
    select_expr = _explodeArrays(element=f_df.schema)
    # While there is at least one Array, explode.
    while "ArrayType(" in f"{f_df.schema}":
        f_df = f_df.selectExpr(select_expr)
        select_expr = _explodeArrays(element=f_df.schema)
    # Flatten the structure
    select_expr = flattenExpr(f_df.schema)
    f_df = f_df.selectExpr(select_expr)
    return f_df


def _explodeArrays(element, root=None):
    """
    Explode the arrays to new rows,
    it only explodes one level of arrays. 
    """
    el_type = type(element)
    expr = []
    try:
        _path = f"{root+'.' if root else ''}{element.name}"
    except AttributeError:
        _path = ""
    if el_type == StructType:
        for t in element:
            res = _explodeArrays(t, root)
            expr.extend(res)
    elif el_type == StructField and type(element.dataType) == ArrayType:
        expr.append(f"explode_outer({_path}) as {_path.replace('.','_')}")
    elif el_type == StructField and type(element.dataType) == StructType:
        expr.extend(_explodeArrays(element.dataType, _path))
    else:
        expr.append(f"{_path}  as {_path.replace('.','_')}")
    return expr


def flattenExpr(element, root=None):
    """
    Flatten the structure of a dataframe
    (using '_' between level names)
    It doesn't work well with arrays, 
    you need to ensure there are no arrays in the input schema
    """
    expr = []
    el_type = type(element)
    try:
        _path = f"{root+'.' if root else ''}{element.name}"
    except AttributeError:
        _path = ""
    if el_type == StructType:
        for t in element:
            expr.extend(flattenExpr(t, root))
    elif el_type == StructField and type(element.dataType) == StructType:
        expr.extend(flattenExpr(element.dataType, _path))
    elif el_type == StructField and type(element.dataType) == ArrayType:
        # You should use flattenArrays to be sure this will not happen
        expr.extend(flattenExpr(element.dataType.elementType, f"{_path}[0]"))
    else:
        expr.append(f"{_path} as {_path.replace('.','_')}")
    return expr

所以我们可以做这样的事情:

json_test = spark.read.json(sc.parallelize(["""{ "Id" : "001", "Type" : "Work", "Tag" : [ { "Id" : "a123", "Location" : [ { "LocName" : "Astro", "LocCode" : "AST" } ], "displayName" : "Al" }, { "Id" : "e789", "Location" : [ { "LocName" : "Cosmos", "LocCode" : "COS" } ], "displayName" : "Tom" } ], "version" : 2 }"""]))
json_test.printSchema()
f_df = flatten(json_test)
f_df.printSchema()
f_df.show()

因此您获得了原始架构:


root
 |-- Id: string (nullable = true)
 |-- Tag: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Location: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- LocCode: string (nullable = true)
 |    |    |    |    |-- LocName: string (nullable = true)
 |    |    |-- displayName: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- version: long (nullable = true)

新架构:

root
 |-- Id: string (nullable = true)
 |-- Tag_Id: string (nullable = true)
 |-- Tag_Location_LocCode: string (nullable = true)
 |-- Tag_Location_LocName: string (nullable = true)
 |-- Tag_displayName: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- version: long (nullable = true)

数据框:

| Id|Tag_Id|Tag_Location_LocCode|Tag_Location_LocName|Tag_displayName|Type|version|
+---+------+--------------------+--------------------+---------------+----+-------+
|001|  a123|                 AST|               Astro|             Al|Work|      2|
|001|  e789|                 COS|              Cosmos|            Tom|Work|      2|
+---+------+--------------------+--------------------+---------------+----+-------+

我希望这可以帮助您考虑解决方案。