将嵌套数组展平为行

时间:2019-07-24 02:02:51

标签: python apache-spark pyspark

我有一个带有嵌套数组字段(事件)的数据框。

-- id: long (nullable = true)
 |-- events: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- value: string (nullable = true)

我想展平数据并获得具有类似以下内容的数据框:

-- id: long (nullable = true)
-- key: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: string (nullable = true)

示例输入:

+-----+-------------------------------------------------------+
|id   |             events                                    |
+-----+-------------------------------------------------------+
|  1  | [[john , 1547758879, 1], [bob, 1547759154, 1]]        |
|  2  | [[samantha , 1547758879, 1], [eric, 1547759154, 1]]   |
+-----+-------------------------------------------------------+

putput示例:

+-----+---------+----------+-----+
|id   |key      |timestamp |value|
+-----+---------+----------+-----+
|  1  |john     |1547758879|    1|
|  1  |bob      |1547759154|    1|
|  2  |samantha |1547758879|    1|
|  2  |eric     |1547759154|    1|
+-----+---------+----------+-----+

3 个答案:

答案 0 :(得分:1)

您可以使用explode将数组的每个元素拆分为自己的行,然后仅select将该结构的各个元素分隔。

case class Event(key: String, timestamp: Long, value: String)
val df = List((1, Seq(Event("john", 1547758879, "1"), 
                      Event("bob", 1547759154, "1"))), 
              (2, Seq(Event("samantha", 1547758879, "1"), 
                      Event("eric", 1547759154, "1")))
             ).toDF("id","events")

df.show(false)
/*--+--------------------------------------------------+
|id |events                                            |
+---+--------------------------------------------------+
|1  |[[john, 1547758879, 1], [bob, 1547759154, 1]]     |
|2  |[[samantha, 1547758879, 1], [eric, 1547759154, 1]]|
+---+-------------------------------------------------*/

val exploded = df.withColumn("events", explode($"events"))
exploded.show(false)
/*--+-------------------------+
|id |events                   |
+---+-------------------------+
|1  |[john, 1547758879, 1]    |
|1  |[bob, 1547759154, 1]     |
|2  |[samantha, 1547758879, 1]|
|2  |[eric, 1547759154, 1]    |
+---+------------------------*/

val unstructured = exploded.select($"id", $"events.key", $"events.timestamp", $"events.value")
unstructured.show
/*--+--------+----------+-----+
| id|     key| timestamp|value|
+---+--------+----------+-----+
|  1|    john|1547758879|    1|
|  1|     bob|1547759154|    1|
|  2|samantha|1547758879|    1|
|  2|    eric|1547759154|    1|
+---+--------+----------+----*/

答案 1 :(得分:0)

您可以尝试以下方法:

  1. 对每个events行中的元素数进行计数:
## recreate the dataframe sample
df = pd.DataFrame(
    [
        [1, [['john' , 1547758879, 1], ['bob', 1547759154, 1]]],
        [2, [['samantha' , 1547758879, 1], ['eric', 1547759154, 1]]]
    ], columns = ['id','events']
)

df['elements'] = df['events'].apply(lambda x: len(x))

Out[36]: 
   id                                             events  elements
0   1      [[john, 1547758879, 1], [bob, 1547759154, 1]]         2
1   2  [[samantha, 1547758879, 1], [eric, 1547759154,1]]         2
  1. 将嵌套结果放到一个列表列表中:
values = df['events'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]

>> flat_results
Out[38]: 
[['john', 1547758879, 1],
 ['bob', 1547759154, 1],
 ['samantha', 1547758879, 1],
 ['eric', 1547759154, 1]]
  1. 从扁平化列表中创建一个新的DataFrame
new_df = pd.DataFrame(flat_results, columns=['key','timestamp','value'])
  1. 使用元素计数来重复原始来源的ID
new_df['id'] = df['id'].repeat(df['elements'].values).values

>> new_df
Out[40]: 
        key   timestamp  value  id
0      john  1547758879      1   1
1       bob  1547759154      1   1
2  samantha  1547758879      1   2
3      eric  1547759154      1   2

答案 2 :(得分:0)

df.select("id", fn.explode(df.events).alias('events')). \
    select("id", fn.col("events").getItem("key").alias("key"),
           fn.col("events").getItem("value").alias("value"),
           fn.col("events").getItem("timestamp").alias("timestamp"))