给出如下的架构:
//Input params
Long userId = 4L;
boolean includeModifiedUser = false;
User userTable = USER.as("userTable");
User modifiedUserTable = USER.as("modifiedUserTable");
SelectQuery selectQuery = create.selectQuery();
selectQuery.addFrom(userTable);
//In some cases I want to include the last modifier in the query
if (includeModifiedUser) {
selectQuery.addJoin(modifiedUserTable, JoinType.LEFT_OUTER_JOIN, modifiedUserTable.ID.eq(userTable.MODIFIED_USER_ID));
}
selectQuery.addConditions(userTable.ID.eq(userId));
Record record = selectQuery.fetchOne();
System.out.println(record.get(userTable.LAST_NAME)); //prints "test1"
System.out.println(record.get(modifiedUserTable.LAST_NAME)); //prints "test1", would expect null as modifiedUserTable is currently not joined
我如何获得如下的架构:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisors: struct
| | | |-- advisor1: string
| | | |-- advisor2: string
目前,我通过选择root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisor1: string
| | |-- advisor2: string
然后按advisor.*
分组并使用first_name, last_name
重建数组来展开数组,展平结构。我希望有一个更清洁/更短的方法来做到这一点。目前,有很多痛苦重命名一些我不想进入的领域和内容。谢谢!
答案 0 :(得分:1)
您可以使用udf更改dataframe中嵌套列的数据类型。 假设您已将数据框读取为df1
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def foo(data):
return(list(map(lambda x: (x["school"], x["advisors"]["advisor1"],\
x["advisors"]["advisor2"]), data)))
struct = ArrayType(StructType([StructField("school", StringType()),
StructField("advisor1", StringType()),
StructField("advisor2", StringType())]))
udf_foo = udf(foo, struct)
df2 = df1.withColumn("degrees",udf_foo("degrees"))
df2.printSchema()
输出
root
|-- degrees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- school: string (nullable = true)
| | |-- advisor1: string (nullable = true)
| | |-- advisor2: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
答案 1 :(得分:0)
这是一个更通用的解决方案,可以展平多个嵌套的结构层:
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
请致电:
my_flattened_df = flatten_df(my_df_having_structs, 3)
(第二个参数是要展平的图层级别,在我的例子中是3个)