我有一个具有以下结构的数据框:
|-- data: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- keyNote: struct (nullable = true)
| | |-- key: string (nullable = true)
| | |-- note: string (nullable = true)
| |-- details: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
如何展平结构并创建新的数据框:
|-- id: long (nullable = true)
|-- keyNote: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- note: string (nullable = true)
|-- details: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
有什么像爆炸,但结构?
答案 0 :(得分:39)
这应该适用于Spark 1.6或更高版本:
df.select(df.col("data.*"))
或
df.select(df.col("data.id"), df.col("data.keyNote"), df.col("data.details"))
答案 1 :(得分:4)
这是一个正在执行您想要的功能的函数,它可以处理包含具有相同名称的列的多个嵌套列:
def flatten_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flat_df
在:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo: struct (nullable = true)
| |-- a: float (nullable = true)
| |-- b: float (nullable = true)
| |-- c: integer (nullable = true)
|-- bar: struct (nullable = true)
| |-- a: float (nullable = true)
| |-- b: float (nullable = true)
| |-- c: integer (nullable = true)
后:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo_a: float (nullable = true)
|-- foo_b: float (nullable = true)
|-- foo_c: integer (nullable = true)
|-- bar_a: float (nullable = true)
|-- bar_b: float (nullable = true)
|-- bar_c: integer (nullable = true)
答案 2 :(得分:2)
一种简单的方法是使用SQL,您可以构建一个SQL查询字符串,将嵌套列别名为平面列。
df.schema()
)schema().fields()
)... 查询:
val newDF = sqlContext.sql("SELECT " + sqlGenerated + " FROM source")
(我更喜欢SQL方式,所以你可以在Spark-shell和它的跨语言上轻松测试它。)
答案 3 :(得分:2)
我将stecos的解决方案进一步推广,因此可以在两个以上的结构层上进行展平:
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
请致电:
my_flattened_df = flatten_df(my_df_having_nested_structs, 3)
(第二个参数是要展平的图层级别,在我的例子中是3)
答案 4 :(得分:2)
此flatten_df
版本使用堆栈来避免递归调用,从而在每个层级别上将数据帧展平:
from pyspark.sql.functions import col
def flatten_df(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
parents, df = stack.pop()
flat_cols = [
col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
for c in df.dtypes
if c[1][:6] != "struct"
]
nested_cols = [
c[0]
for c in df.dtypes
if c[1][:6] == "struct"
]
columns.extend(flat_cols)
for nested_col in nested_cols:
projected_df = df.select(nested_col + ".*")
stack.append((parents + (nested_col,), projected_df))
return nested_df.select(columns)
示例:
from pyspark.sql.types import StringType, StructField, StructType
schema = StructType([
StructField("some", StringType()),
StructField("nested", StructType([
StructField("nestedchild1", StringType()),
StructField("nestedchild2", StringType())
])),
StructField("renested", StructType([
StructField("nested", StructType([
StructField("nestedchild1", StringType()),
StructField("nestedchild2", StringType())
]))
]))
])
data = [
{
"some": "value1",
"nested": {
"nestedchild1": "value2",
"nestedchild2": "value3",
},
"renested": {
"nested": {
"nestedchild1": "value4",
"nestedchild2": "value5",
}
}
}
]
df = spark.createDataFrame(data, schema)
flat_df = flatten_df(df)
print(flat_df.collect())
打印:
[Row(some=u'value1', renested_nested_nestedchild1=u'value4', renested_nested_nestedchild2=u'value5', nested_nestedchild1=u'value2', nested_nestedchild2=u'value3')]
答案 5 :(得分:1)
以下在Spark sql中为我工作
import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.functions.{explode, expr, posexplode, when}
object StackOverFlowQuestion {
def main(args: Array[String]): Unit = {
val logger = Logger.getLogger("FlattenTest")
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName("FlattenTest")
.config("spark.sql.warehouse.dir", "C:\\Temp\\hive")
.master("local[2]")
//.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val stringTest =
"""{
"total_count": 123,
"page_size": 20,
"another_id": "gdbfdbfdbd",
"sen": [{
"id": 123,
"ses_id": 12424343,
"columns": {
"blah": "blah",
"count": 1234
},
"class": {},
"class_timestamps": {},
"sentence": "spark is good"
}]
}
"""
val result = List(stringTest)
val githubRdd=spark.sparkContext.makeRDD(result)
val gitHubDF=spark.read.json(githubRdd)
gitHubDF.show()
gitHubDF.printSchema()
gitHubDF.registerTempTable("JsonTable")
spark.sql("with cte as" +
"(" +
"select explode(sen) as senArray from JsonTable" +
"), cte_2 as" +
"(" +
"select senArray.ses_id,senArray.ses_id,senArray.columns.* from cte" +
")" +
"select * from cte_2"
).show()
spark.stop()
}
}
输出:-
+----------+---------+--------------------+-----------+
|another_id|page_size| sen|total_count|
+----------+---------+--------------------+-----------+
|gdbfdbfdbd| 20|[[[blah, 1234], 1...| 123|
+----------+---------+--------------------+-----------+
root
|-- another_id: string (nullable = true)
|-- page_size: long (nullable = true)
|-- sen: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- columns: struct (nullable = true)
| | | |-- blah: string (nullable = true)
| | | |-- count: long (nullable = true)
| | |-- id: long (nullable = true)
| | |-- sentence: string (nullable = true)
| | |-- ses_id: long (nullable = true)
|-- total_count: long (nullable = true)
+--------+--------+----+-----+
| ses_id| ses_id|blah|count|
+--------+--------+----+-----+
|12424343|12424343|blah| 1234|
+--------+--------+----+-----+
答案 6 :(得分:1)
如果仅隐蔽结构类型,则可以使用此方法。 我不建议转换数组,因为它可能导致记录重复。
from pyspark.sql.functions import col
from pyspark.sql.types import StructType
def flatten_schema(schema, prefix=""):
return_schema = []
for field in schema.fields:
if isinstance(field.dataType, StructType):
if prefix:
return_schema = return_schema + flatten_schema(field.dataType, "{}.{}".format(prefix, field.name))
else:
return_schema = return_schema + flatten_schema(field.dataType, field.name)
else:
if prefix:
field_path = "{}.{}".format(prefix, field.name)
return_schema.append(col(field_path).alias(field_path.replace(".", "_")))
else:
return_schema.append(field.name)
return return_schema
您可以将其用作
new_schema = flatten_schema(df.schema)
df1 = df.select(se)
df1.show()
答案 7 :(得分:1)
更加紧凑和有效的实现方式
无需创建列表并对其进行迭代。 您可以根据字段的类型(无论是否为结构)对字段“执行”操作。
def flatten_df(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
parents, df = stack.pop()
for column_name, column_type in df.dtypes:
if column_type[:6] == "struct":
projected_df = df.select(column_name + ".*")
stack.append((parents + (column_name,), projected_df))
else:
columns.append(col(".".join(parents + (column_name,))).alias("_".join(parents + (column_name,))))
return nested_df.select(columns)
答案 8 :(得分:1)
对于Spark 2.4.5,
同时,df.select(df.col("data.*"))
将给您org.apache.spark.sql.AnalysisException: No such struct field * in
例外
这将起作用:-
df.select($"data.*")
答案 9 :(得分:0)
这是scala火花。
val totalMainArrayBuffer=collection.mutable.ArrayBuffer[String]()
def flatten_df_Struct(dfTemp:org.apache.spark.sql.DataFrame,dfTotalOuter:org.apache.spark.sql.DataFrame):org.apache.spark.sql.DataFrame=
{
//dfTemp.printSchema
val totalStructCols=dfTemp.dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(_.split(",",2)(1).contains("Struct")) // in case i the column names come with the word Struct embedded in it
val mainArrayBuffer=collection.mutable.ArrayBuffer[String]()
for(totalStructCol <- totalStructCols)
{
val tempArrayBuffer=collection.mutable.ArrayBuffer[String]()
tempArrayBuffer+=s"${totalStructCol.split(",")(0)}.*"
//tempArrayBuffer.toSeq.toDF.show(false)
val columnsInside=dfTemp.selectExpr(tempArrayBuffer:_*).columns
for(column <- columnsInside)
mainArrayBuffer+=s"${totalStructCol.split(",")(0)}.${column} as ${totalStructCol.split(",")(0)}_${column}"
//mainArrayBuffer.toSeq.toDF.show(false)
}
//dfTemp.selectExpr(mainArrayBuffer:_*).printSchema
val nonStructCols=dfTemp.selectExpr(mainArrayBuffer:_*).dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(!_.split(",",2)(1).contains("Struct")) // in case i the column names come with the word Struct embedded in it
for (nonStructCol <- nonStructCols)
totalMainArrayBuffer+=s"${nonStructCol.split(",")(0).replace("_",".")} as ${nonStructCol.split(",")(0)}" // replacing _ by . in origial select clause if it's an already nested column
dfTemp.selectExpr(mainArrayBuffer:_*).dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(_.split(",",2)(1).contains("Struct")).size
match {
case value if value ==0 => dfTotalOuter.selectExpr(totalMainArrayBuffer:_*)
case _ => flatten_df_Struct(dfTemp.selectExpr(mainArrayBuffer:_*),dfTotalOuter)
}
}
def flatten_df(dfTemp:org.apache.spark.sql.DataFrame):org.apache.spark.sql.DataFrame=
{
var totalArrayBuffer=collection.mutable.ArrayBuffer[String]()
val totalNonStructCols=dfTemp.dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(!_.split(",",2)(1).contains("Struct")) // in case i the column names come with the word Struct embedded in it
for (totalNonStructCol <- totalNonStructCols)
totalArrayBuffer+=s"${totalNonStructCol.split(",")(0)}"
totalMainArrayBuffer.clear
flatten_df_Struct(dfTemp,dfTemp) // flattened schema is now in totalMainArrayBuffer
totalArrayBuffer=totalArrayBuffer++totalMainArrayBuffer
dfTemp.selectExpr(totalArrayBuffer:_*)
}
flatten_df(dfTotal.withColumn("tempStruct",lit(5))).printSchema
文件
{"num1":1,"num2":2,"bool1":true,"bool2":false,"double1":4.5,"double2":5.6,"str1":"a","str2":"b","arr1":[3,4,5],"map1":{"cool":1,"okay":2,"normal":3},"carInfo":{"Engine":{"Make":"sa","Power":{"IC":"900","battery":"165"},"Redline":"11500"} ,"Tyres":{"Make":"Pirelli","Compound":"c1","Life":"120"}}}
{"num1":3,"num2":4,"bool1":false,"bool2":false,"double1":4.2,"double2":5.5,"str1":"u","str2":"n","arr1":[6,7,9],"map1":{"fast":1,"medium":2,"agressive":3},"carInfo":{"Engine":{"Make":"na","Power":{"IC":"800","battery":"150"},"Redline":"10000"} ,"Tyres":{"Make":"Pirelli","Compound":"c2","Life":"100"}}}
{"num1":8,"num2":4,"bool1":true,"bool2":true,"double1":5.7,"double2":7.5,"str1":"t","str2":"k","arr1":[11,12,23],"map1":{"preserve":1,"medium":2,"fast":3},"carInfo":{"Engine":{"Make":"ta","Power":{"IC":"950","battery":"170"},"Redline":"12500"} ,"Tyres":{"Make":"Pirelli","Compound":"c3","Life":"80"}}}
{"num1":7,"num2":9,"bool1":false,"bool2":true,"double1":33.2,"double2":7.5,"str1":"b","str2":"u","arr1":[12,14,5],"map1":{"normal":1,"preserve":2,"agressive":3},"carInfo":{"Engine":{"Make":"pa","Power":{"IC":"920","battery":"160"},"Redline":"11800"} ,"Tyres":{"Make":"Pirelli","Compound":"c4","Life":"70"}}}
之前:
root
|-- arr1: array (nullable = true)
| |-- element: long (containsNull = true)
|-- bool1: boolean (nullable = true)
|-- bool2: boolean (nullable = true)
|-- carInfo: struct (nullable = true)
| |-- Engine: struct (nullable = true)
| | |-- Make: string (nullable = true)
| | |-- Power: struct (nullable = true)
| | | |-- IC: string (nullable = true)
| | | |-- battery: string (nullable = true)
| | |-- Redline: string (nullable = true)
| |-- Tyres: struct (nullable = true)
| | |-- Compound: string (nullable = true)
| | |-- Life: string (nullable = true)
| | |-- Make: string (nullable = true)
|-- double1: double (nullable = true)
|-- double2: double (nullable = true)
|-- map1: struct (nullable = true)
| |-- agressive: long (nullable = true)
| |-- cool: long (nullable = true)
| |-- fast: long (nullable = true)
| |-- medium: long (nullable = true)
| |-- normal: long (nullable = true)
| |-- okay: long (nullable = true)
| |-- preserve: long (nullable = true)
|-- num1: long (nullable = true)
|-- num2: long (nullable = true)
|-- str1: string (nullable = true)
|-- str2: string (nullable = true
之后:
root
|-- arr1: array (nullable = true)
| |-- element: long (containsNull = true)
|-- bool1: boolean (nullable = true)
|-- bool2: boolean (nullable = true)
|-- double1: double (nullable = true)
|-- double2: double (nullable = true)
|-- num1: long (nullable = true)
|-- num2: long (nullable = true)
|-- str1: string (nullable = true)
|-- str2: string (nullable = true)
|-- map1_agressive: long (nullable = true)
|-- map1_cool: long (nullable = true)
|-- map1_fast: long (nullable = true)
|-- map1_medium: long (nullable = true)
|-- map1_normal: long (nullable = true)
|-- map1_okay: long (nullable = true)
|-- map1_preserve: long (nullable = true)
|-- carInfo_Engine_Make: string (nullable = true)
|-- carInfo_Engine_Redline: string (nullable = true)
|-- carInfo_Tyres_Compound: string (nullable = true)
|-- carInfo_Tyres_Life: string (nullable = true)
|-- carInfo_Tyres_Make: string (nullable = true)
|-- carInfo_Engine_Power_IC: string (nullable = true)
|-- carInfo_Engine_Power_battery: string (nullable = true)
尝试了2个级别,它起作用了
答案 10 :(得分:0)
PySpark解决方案可将具有任意深度级别的结构和数组类型的嵌套df扁平化。对此进行了改进:https://stackoverflow.com/a/56533459/7131019
from pyspark.sql.types import *
from pyspark.sql import functions as f
def flatten_structs(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
parents, df = stack.pop()
array_cols = [
c[0]
for c in df.dtypes
if c[1][:5] == "array"
]
flat_cols = [
f.col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
for c in df.dtypes
if c[1][:6] != "struct"
]
nested_cols = [
c[0]
for c in df.dtypes
if c[1][:6] == "struct"
]
columns.extend(flat_cols)
for nested_col in nested_cols:
projected_df = df.select(nested_col + ".*")
stack.append((parents + (nested_col,), projected_df))
return nested_df.select(columns)
def flatten_array_struct_df(df):
array_cols = [
c[0]
for c in df.dtypes
if c[1][:5] == "array"
]
while len(array_cols) > 0:
for array_col in array_cols:
cols_to_select = [x for x in df.columns if x != array_col ]
df = df.withColumn(array_col, f.explode(f.col(array_col)))
df = flatten_structs(df)
array_cols = [
c[0]
for c in df.dtypes
if c[1][:5] == "array"
]
return df
flat_df = flatten_array_struct_df(df)