pyspark在字符串中嵌套列

时间:2018-07-12 10:53:27

标签: apache-spark pyspark apache-spark-sql

我正在与PySpark合作。我从csv中加载了一个DataFrame,其中包含以下架构:

root
 |-- id: string (nullable = true)
 |-- date: date (nullable = true)
 |-- users: string (nullable = true)

如果我显示前两行,则它看起来像:

+---+----------+---------------------------------------------------+
| id|      date|users                                              |
+---+----------+---------------------------------------------------+
|  1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
|  2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}       |
+---+----------+---------------------------------------------------+

我想创建一个新的DataFrame,其中包含每个元素细分的'user'字符串。我想要类似的东西

id  user_id     user_product
1   1           xxx
1   1           yyy
1   1           zzz
1   2           aaa
1   2           bbb
1   3           <null>
2   1           uuu

etc...

我尝试了许多方法,但似乎无法使它起作用。 我能得到的最接近的定义是如下所示的架构,并使用from_json创建一个新的df应用架构:

userSchema = StructType([
    StructField("user_id", StringType()),
    StructField("product_list", StructType([
        StructField("product", StringType())
    ]))
]) 

user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))

这将返回正确的模式:

root
 |-- id: string (nullable = true)
 |-- test: struct (nullable = true)
 |    |-- user_id: string (nullable = true)
 |    |-- product_list: struct (nullable = true)
 |    |    |-- product: string (nullable = true)

但是当我显示“测试” struct的任何部分时,它会返回nulls而不是值,例如

user_df.select('test.user_id').show()

返回test.user_id:

+-------+
|user_id|
+-------+
|   null|
|   null|
+-------+

也许我不应该使用from_json,因为用户字符串不是纯JSON。对我可以采取的任何帮助吗?

2 个答案:

答案 0 :(得分:1)

模式应符合数据的形状。不幸的是,from_json仅支持StructType(...)ArrayType(StructType(...)),除非您可以保证所有记录具有相同的键集,否则它们在这里将无用。

相反,您可以使用UserDefinedFunction

import json
from pyspark.sql.functions import explode, udf

df = spark.createDataFrame([
    (1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
    (2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
    ("id", "date", "users")
)


@udf("map<string, array<string>>")
def parse(s):
    try:
        return json.loads(s)
    except:
        pass

(df
     .select("id", "date", 
             explode(parse("users")).alias("user_id", "user_product"))
     .withColumn("user_product", explode("user_product"))
     .show())
# +---+----------+-------+------------+
# | id|      date|user_id|user_product|
# +---+----------+-------+------------+
# |  1|2017-12-03|      1|         xxx|
# |  1|2017-12-03|      1|         yyy|
# |  1|2017-12-03|      1|         zzz|
# |  1|2017-12-03|      2|         aaa|
# |  1|2017-12-03|      2|         bbb|
# |  2|2017-12-04|      1|         uuu|
# |  2|2017-12-04|      1|         yyy|
# |  2|2017-12-04|      1|         zzz|
# |  2|2017-12-04|      2|         aaa|
# +---+----------+-------+------------+

答案 1 :(得分:0)

您不需要使用from_json。您必须explode两次,一次user_id,一次users

import pyspark.sql.functions as F

df = sql.createDataFrame([
        (1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),   
        (2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"],      "3":[]} )],
        ['id','date','users']
    )

df = df.select('id','date',F.explode('users').alias('user_id','users'))\
       .select('id','date','user_id',F.explode('users').alias('users'))

df.show()

+---+----------+-------+-----+
| id|      date|user_id|users|
+---+----------+-------+-----+
|  1|2017-12-03|      1|  xxx|
|  1|2017-12-03|      1|  yyy|
|  1|2017-12-03|      1|  zzz|
|  1|2017-12-03|      2|  aaa|
|  1|2017-12-03|      2|  bbb|
|  2|2017-12-04|      1|  uuu|
|  2|2017-12-04|      1|  yyy|
|  2|2017-12-04|      1|  zzz|
|  2|2017-12-04|      2|  aaa|
+---+----------+-------+-----+