Question

这是我第一次这样询问。我有一个这样的数据框

+----------+----------------------------------+
|        id|                              data|
+----------+----------------------------------+
|     '001'|     '[{"index":1}, {"index": 2}]'|
|     '002'|     '[{"index":3}, {"index": 4}]'|
+----------+----------------------------------+

我需要将其转换为新的DF

+----------+---------+
|        id|    index|
+----------+---------+
|     '001'|        1|
|     '001'|        2|
|     '002'|        3|
|     '002'|        4|
+----------+---------+

有办法吗？谢谢。

Answer 1

试试这个〜

 df = self.spark.createDataFrame(
            [('001', '[{"index": 1}, {"index": 2}]'),
             ('002', '[{"index": 3}, {"index": 4}]'),
             ],
            ("id", "data"))

        schema = ArrayType(StructType([StructField("index", IntegerType())]))
        df = df.withColumn("json", from_json("data", schema))

        df.show(100)
        df = df.select(col("id"), explode("json").alias("index"))
        df.show(100)


+---+--------------------+----------+
| id|                data|      json|
+---+--------------------+----------+
|001|[{"index": 1}, {"...|[[1], [2]]|
|002|[{"index": 3}, {"...|[[3], [4]]|
+---+--------------------+----------+

+---+-----+
| id|index|
+---+-----+
|001|  [1]|
|001|  [2]|
|002|  [3]|
|002|  [4]|
+---+-----+

Answer 2

这是我解决的另一种方法。它涉及各种语句，但是，所有这些语句可以合并为一个语句以产生所需的输出。

创建名为“ df”的初始数据框后，

df.show(5,False)
+---+----------------------------+
|id |data                        |
+---+----------------------------+
|001|[{"index": 1}, {"index": 2}]|
|002|[{"index": 3}, {"index": 4}]|
+---+----------------------------+

df2 = df.select（col（'id'），split（df.data，'，'）。alias（'list'））

这将创建一个名为“ df2”的数据框，该数据框将第二列拆分为数组类型。

df2.show(5,False)
+---+-------------------------------+
|id |list                           |
+---+-------------------------------+
|001|[[{"index": 1},  {"index": 2}]]|
|002|[[{"index": 3},  {"index": 4}]]|
+---+-------------------------------+

然后，运行爆炸功能， df3 = df2.select（col（'id'），explode（df2.list））

df3.show(5,False)
+---+--------------+
|id |col           |
+---+--------------+
|001|[{"index": 1} |
|001| {"index": 2}]|
|002|[{"index": 3} |
|002| {"index": 4}]|
+---+--------------+

之后， df4 = df3.select（col（'id'），regexp_extract（'col'，'（\ d +）'，1）.alias（'no_only'））此转换检查爆炸列中的数字。

df4.show(5,False)
+---+-------+
|id |no_only|
+---+-------+
|001|1      |
|001|2      |
|002|3      |
|002|4      |
+---+-------+

使用列包含JSON数据的数据框创建新的数据框

2 个答案: