Question

我有一个带有嵌套数组字段（事件）的数据框。

-- id: long (nullable = true)
 |-- events: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- value: string (nullable = true)

我想展平数据并获得具有类似以下内容的数据框：

-- id: long (nullable = true)
-- key: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: string (nullable = true)

示例输入：

+-----+-------------------------------------------------------+
|id   |             events                                    |
+-----+-------------------------------------------------------+
|  1  | [[john , 1547758879, 1], [bob, 1547759154, 1]]        |
|  2  | [[samantha , 1547758879, 1], [eric, 1547759154, 1]]   |
+-----+-------------------------------------------------------+

putput示例：

+-----+---------+----------+-----+
|id   |key      |timestamp |value|
+-----+---------+----------+-----+
|  1  |john     |1547758879|    1|
|  1  |bob      |1547759154|    1|
|  2  |samantha |1547758879|    1|
|  2  |eric     |1547759154|    1|
+-----+---------+----------+-----+

Answer 1

您可以使用explode将数组的每个元素拆分为自己的行，然后仅select将该结构的各个元素分隔。

case class Event(key: String, timestamp: Long, value: String)
val df = List((1, Seq(Event("john", 1547758879, "1"), 
                      Event("bob", 1547759154, "1"))), 
              (2, Seq(Event("samantha", 1547758879, "1"), 
                      Event("eric", 1547759154, "1")))
             ).toDF("id","events")

df.show(false)
/*--+--------------------------------------------------+
|id |events                                            |
+---+--------------------------------------------------+
|1  |[[john, 1547758879, 1], [bob, 1547759154, 1]]     |
|2  |[[samantha, 1547758879, 1], [eric, 1547759154, 1]]|
+---+-------------------------------------------------*/

val exploded = df.withColumn("events", explode($"events"))
exploded.show(false)
/*--+-------------------------+
|id |events                   |
+---+-------------------------+
|1  |[john, 1547758879, 1]    |
|1  |[bob, 1547759154, 1]     |
|2  |[samantha, 1547758879, 1]|
|2  |[eric, 1547759154, 1]    |
+---+------------------------*/

val unstructured = exploded.select($"id", $"events.key", $"events.timestamp", $"events.value")
unstructured.show
/*--+--------+----------+-----+
| id|     key| timestamp|value|
+---+--------+----------+-----+
|  1|    john|1547758879|    1|
|  1|     bob|1547759154|    1|
|  2|samantha|1547758879|    1|
|  2|    eric|1547759154|    1|
+---+--------+----------+----*/

Answer 2

您可以尝试以下方法：

对每个events行中的元素数进行计数：

## recreate the dataframe sample
df = pd.DataFrame(
    [
        [1, [['john' , 1547758879, 1], ['bob', 1547759154, 1]]],
        [2, [['samantha' , 1547758879, 1], ['eric', 1547759154, 1]]]
    ], columns = ['id','events']
)

df['elements'] = df['events'].apply(lambda x: len(x))

Out[36]: 
   id                                             events  elements
0   1      [[john, 1547758879, 1], [bob, 1547759154, 1]]         2
1   2  [[samantha, 1547758879, 1], [eric, 1547759154,1]]         2

将嵌套结果放到一个列表列表中：

values = df['events'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]

>> flat_results
Out[38]: 
[['john', 1547758879, 1],
 ['bob', 1547759154, 1],
 ['samantha', 1547758879, 1],
 ['eric', 1547759154, 1]]

从扁平化列表中创建一个新的DataFrame

new_df = pd.DataFrame(flat_results, columns=['key','timestamp','value'])

使用元素计数来重复原始来源的ID

new_df['id'] = df['id'].repeat(df['elements'].values).values

>> new_df
Out[40]: 
        key   timestamp  value  id
0      john  1547758879      1   1
1       bob  1547759154      1   1
2  samantha  1547758879      1   2
3      eric  1547759154      1   2

Answer 3

df.select("id", fn.explode(df.events).alias('events')). \
    select("id", fn.col("events").getItem("key").alias("key"),
           fn.col("events").getItem("value").alias("value"),
           fn.col("events").getItem("timestamp").alias("timestamp"))

将嵌套数组展平为行

3 个答案: