有没有办法压缩任意嵌套的Spark Dataframe?我所看到的大部分工作都是针对特定架构编写的,并且我希望能够通过不同的嵌套类型(例如StructType,ArrayType,MapType等)来泛化Dataframe。
假设我有一个类似的架构:
StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...)))
希望将其改编为具有以下结构的平台:
field1
field2
nested_array.nested_field1
nested_array.nested_field2
仅供参考,寻找Pyspark的建议,但其他风味的Spark也很受欢迎。
答案 0 :(得分:10)
这个问题可能有点旧,但对于那些仍在寻找解决方案的人来说,你可以使用select *来内联复杂的数据类型:
首先让我们创建嵌套数据框:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
"field1": 1,
"field2": 2,
"nested_array":{
"nested_field1": 3,
"nested_field2": 4
}
}
"""]))
现在要压扁它:
flat_df = nested_df.select("field1", "field2", "nested_array.*")
您可以在此处找到有用的示例: https://docs.databricks.com/delta/data-transformation/complex-types.html
如果您有太多嵌套数组,可以使用:
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])
答案 1 :(得分:2)
我已经开发了一种递归方法来展平任何嵌套的DataFrame。 该实现在AWS Data Wrangler项目上进行:
import awswrangler
session = awswrangler.Session(spark_session=spark)
dfs = session.spark.flatten(dataframe=df_nested)
for name, df_flat in dfs.items():
print(name)
df_flat.show()
或check the sources查看原始实现。
答案 2 :(得分:1)
Here's my final approach:
1) Map the rows in the dataframe to an rdd of dict. Find suitable python code online for flattening dict.
flat_rdd = nested_df.map(lambda x : flatten(x))
where
def flatten(x):
x_dict = x.asDict()
...some flattening code...
return x_dict
2) Convert the RDD[dict] back to a dataframe
flat_df = sqlContext.createDataFrame(flat_rdd)
答案 3 :(得分:0)
这将拼合具有结构类型和数组类型的嵌套df。 通过Json读取数据时通常会有所帮助。 对此https://stackoverflow.com/a/56533459/7131019
进行了改进from pyspark.sql.types import *
from pyspark.sql import functions as f
def flatten_structs(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
parents, df = stack.pop()
array_cols = [
c[0]
for c in df.dtypes
if c[1][:5] == "array"
]
flat_cols = [
f.col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
for c in df.dtypes
if c[1][:6] != "struct"
]
nested_cols = [
c[0]
for c in df.dtypes
if c[1][:6] == "struct"
]
columns.extend(flat_cols)
for nested_col in nested_cols:
projected_df = df.select(nested_col + ".*")
stack.append((parents + (nested_col,), projected_df))
return nested_df.select(columns)
def flatten_array_struct_df(df):
array_cols = [
c[0]
for c in df.dtypes
if c[1][:5] == "array"
]
while len(array_cols) > 0:
for array_col in array_cols:
cols_to_select = [x for x in df.columns if x != array_col ]
df = df.withColumn(array_col, f.explode(f.col(array_col)))
df = flatten_structs(df)
array_cols = [
c[0]
for c in df.dtypes
if c[1][:5] == "array"
]
return df
flat_df = flatten_array_struct_df(df)
**
答案 4 :(得分:-1)
以下要点将使嵌套json的结构变平,
import typing as T
import cytoolz.curried as tz
import pyspark
def schema_to_columns(schema: pyspark.sql.types.StructType) -> T.List[T.List[str]]:
"""
Produce a flat list of column specs from a possibly nested DataFrame schema
"""
columns = list()
def helper(schm: pyspark.sql.types.StructType, prefix: list = None):
if prefix is None:
prefix = list()
for item in schm.fields:
if isinstance(item.dataType, pyspark.sql.types.StructType):
helper(item.dataType, prefix + [item.name])
else:
columns.append(prefix + [item.name])
helper(schema)
return columns
def flatten_frame(frame: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
aliased_columns = list()
for col_spec in schema_to_columns(frame.schema):
c = tz.get_in(col_spec, frame)
if len(col_spec) == 1:
aliased_columns.append(c)
else:
aliased_columns.append(c.alias(':'.join(col_spec)))
return frame.select(aliased_columns)
然后您可以将嵌套数据展平为
flatten_data = flatten_frame(nested_df)
这将为您提供扁平的数据框。
要点取自https://gist.github.com/DGrady/b7e7ff3a80d7ee16b168eb84603f5599