Question

给出如下的架构：

//Input params
Long userId = 4L;
boolean includeModifiedUser = false;

User userTable = USER.as("userTable");
User modifiedUserTable = USER.as("modifiedUserTable");

SelectQuery selectQuery = create.selectQuery();
selectQuery.addFrom(userTable);

//In some cases I want to include the last modifier in the query
if (includeModifiedUser) {
    selectQuery.addJoin(modifiedUserTable, JoinType.LEFT_OUTER_JOIN, modifiedUserTable.ID.eq(userTable.MODIFIED_USER_ID));
}

selectQuery.addConditions(userTable.ID.eq(userId));
Record record = selectQuery.fetchOne();

System.out.println(record.get(userTable.LAST_NAME)); //prints "test1"
System.out.println(record.get(modifiedUserTable.LAST_NAME)); //prints "test1", would expect null as modifiedUserTable is currently not joined

我如何获得如下的架构：

root
|-- first_name: string
|-- last_name: string
|-- degrees: array
|    |-- element: struct
|    |    |-- school: string
|    |    |-- advisors: struct
|    |    |    |-- advisor1: string
|    |    |    |-- advisor2: string

Answer 1

您可以使用udf更改dataframe中嵌套列的数据类型。假设您已将数据框读取为df1

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def foo(data):

    return(list(map(lambda x: (x["school"], x["advisors"]["advisor1"],\
                               x["advisors"]["advisor2"]), data)))


struct = ArrayType(StructType([StructField("school", StringType()),
                              StructField("advisor1", StringType()),
                              StructField("advisor2", StringType())]))
udf_foo = udf(foo, struct)

df2 = df1.withColumn("degrees",udf_foo("degrees"))
df2.printSchema()

输出

root
 |-- degrees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- school: string (nullable = true)
 |    |    |-- advisor1: string (nullable = true)
 |    |    |-- advisor2: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)

Answer 2

这是一个更通用的解决方案，可以展平多个嵌套的结构层：

def flatten_df(nested_df, layers):
    flat_cols = []
    nested_cols = []
    flat_df = []

    flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
    nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])

    flat_df.append(nested_df.select(flat_cols[0] +
                               [col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols[0]
                                for c in nested_df.select(nc+'.*').columns])
                  )
    for i in range(1, layers):
        print (flat_cols[i-1])
        flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
        nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])

        flat_df.append(flat_df[i-1].select(flat_cols[i] +
                                [col(nc+'.'+c).alias(nc+'_'+c)
                                    for nc in nested_cols[i]
                                    for c in flat_df[i-1].select(nc+'.*').columns])
        )

    return flat_df[-1]

请致电：

my_flattened_df = flatten_df(my_df_having_structs, 3)

（第二个参数是要展平的图层级别，在我的例子中是3个）

在PySpark数组中展平嵌套结构

2 个答案: