我有一个带有嵌套数组字段(事件)的数据框。
-- id: long (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- timestamp: long (nullable = true)
| | |-- value: string (nullable = true)
我想展平数据并获得具有类似以下内容的数据框:
-- id: long (nullable = true)
-- key: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: string (nullable = true)
示例输入:
+-----+-------------------------------------------------------+
|id | events |
+-----+-------------------------------------------------------+
| 1 | [[john , 1547758879, 1], [bob, 1547759154, 1]] |
| 2 | [[samantha , 1547758879, 1], [eric, 1547759154, 1]] |
+-----+-------------------------------------------------------+
putput示例:
+-----+---------+----------+-----+
|id |key |timestamp |value|
+-----+---------+----------+-----+
| 1 |john |1547758879| 1|
| 1 |bob |1547759154| 1|
| 2 |samantha |1547758879| 1|
| 2 |eric |1547759154| 1|
+-----+---------+----------+-----+
答案 0 :(得分:1)
您可以使用explode
将数组的每个元素拆分为自己的行,然后仅select
将该结构的各个元素分隔。
case class Event(key: String, timestamp: Long, value: String)
val df = List((1, Seq(Event("john", 1547758879, "1"),
Event("bob", 1547759154, "1"))),
(2, Seq(Event("samantha", 1547758879, "1"),
Event("eric", 1547759154, "1")))
).toDF("id","events")
df.show(false)
/*--+--------------------------------------------------+
|id |events |
+---+--------------------------------------------------+
|1 |[[john, 1547758879, 1], [bob, 1547759154, 1]] |
|2 |[[samantha, 1547758879, 1], [eric, 1547759154, 1]]|
+---+-------------------------------------------------*/
val exploded = df.withColumn("events", explode($"events"))
exploded.show(false)
/*--+-------------------------+
|id |events |
+---+-------------------------+
|1 |[john, 1547758879, 1] |
|1 |[bob, 1547759154, 1] |
|2 |[samantha, 1547758879, 1]|
|2 |[eric, 1547759154, 1] |
+---+------------------------*/
val unstructured = exploded.select($"id", $"events.key", $"events.timestamp", $"events.value")
unstructured.show
/*--+--------+----------+-----+
| id| key| timestamp|value|
+---+--------+----------+-----+
| 1| john|1547758879| 1|
| 1| bob|1547759154| 1|
| 2|samantha|1547758879| 1|
| 2| eric|1547759154| 1|
+---+--------+----------+----*/
答案 1 :(得分:0)
您可以尝试以下方法:
events
行中的元素数进行计数:## recreate the dataframe sample
df = pd.DataFrame(
[
[1, [['john' , 1547758879, 1], ['bob', 1547759154, 1]]],
[2, [['samantha' , 1547758879, 1], ['eric', 1547759154, 1]]]
], columns = ['id','events']
)
df['elements'] = df['events'].apply(lambda x: len(x))
Out[36]:
id events elements
0 1 [[john, 1547758879, 1], [bob, 1547759154, 1]] 2
1 2 [[samantha, 1547758879, 1], [eric, 1547759154,1]] 2
values = df['events'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]
>> flat_results
Out[38]:
[['john', 1547758879, 1],
['bob', 1547759154, 1],
['samantha', 1547758879, 1],
['eric', 1547759154, 1]]
new_df = pd.DataFrame(flat_results, columns=['key','timestamp','value'])
new_df['id'] = df['id'].repeat(df['elements'].values).values
>> new_df
Out[40]:
key timestamp value id
0 john 1547758879 1 1
1 bob 1547759154 1 1
2 samantha 1547758879 1 2
3 eric 1547759154 1 2
答案 2 :(得分:0)
df.select("id", fn.explode(df.events).alias('events')). \
select("id", fn.col("events").getItem("key").alias("key"),
fn.col("events").getItem("value").alias("value"),
fn.col("events").getItem("timestamp").alias("timestamp"))