Question

有以下数据框：

  >>> df.printSchema()
  root
   |-- I: string (nullable = true)
   |-- F: string (nullable = true)
   |-- D: string (nullable = true)
   |-- T: string (nullable = true)
   |-- S: string (nullable = true)
   |-- P: string (nullable = true)

F列为字典格式：

   {"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04",...}

我需要按以下方式阅读F列，并创建两个新的P和N列

   P1 => "1:0.01"
   P2 => "3:0.03,4:0.04"
   and so on

 +--------+--------+-----------------+-----+------+--------+----+
 | I      |  P     | N               |  D  | T    | S      | P  |
 +--------+--------+---------------- +------------+--------+----+
 | i1     |  p1    | 1:0.01          |  d1 | t1   | s1     | p1 |
 |--------|--------|-----------------|-----|------|--------|----|
 | i1     |  p2    | 3:0.03,4:0.04   |  d1 | t1   | s1     | p1 |
 |--------|--------|-----------------|-----|------|--------|----|
 | i1     |  p3    | 3:0.03,4:0.04   |  d1 | t1   | s1     | p1 |
 |--------|--------|-----------------|-----|------|--------|----|
 | i2     |  ...   | ....            |  d2 | t2   | s2     | p2 |
 +--------+--------+-----------------+-----+------+--------+----+

Pyspark有任何建议吗？

Answer 1

尝试一下：

您拥有的DataFrame

from pyspark.sql import functions as F

df = spark.createDataFrame([('id01', '{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}')], ['I', 'F'])
df.printSchema()
df.show(truncate=False)

您可以在帖子中看到架构和数据相同。

root
 |-- I: string (nullable = true)
 |-- F: string (nullable = true)

+----+---------------------------------------------------------+
|I   |F                                                        |
+----+---------------------------------------------------------+
|id01|{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}|
+----+---------------------------------------------------------+

处理字符串以区分子命令

# remove '{' and '}'
df = df.withColumn('array', F.regexp_replace('F', r'\{', ''))
df = df.withColumn('array', F.regexp_replace('array', r'\}', ''))

# replace the comma with '#' between each sub-dict so we can split on them
df = df.withColumn('array', F.regexp_replace('array', '","', '"#"' ))
df = df.withColumn('array', F.split('array', '#'))
df.show(truncate=False)

这是中间结果

+----+---------------------------------------------------------+-----------------------------------------------------------+
|I   |F                                                        |array                                                      |
+----+---------------------------------------------------------+-----------------------------------------------------------+
|id01|{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}|["P1":"1:0.01", "P2":"3:0.03,4:0.04", "P3":"3:0.03,4:0.04"]|
+----+---------------------------------------------------------+-----------------------------------------------------------+

现在为每个子字典生成一行

# generate one row for each element int he array
df = df.withColumn('exploded', F.explode(df['array']))

# Need to distinguish ':' in the dict and in the value
df = df.withColumn('exploded', F.regexp_replace('exploded', '":"', '"#"' ))
df = df.withColumn('exploded', F.split('exploded', '#'))

# extract the name and value
df = df.withColumn('P', F.col('exploded')[0])
df = df.withColumn('N', F.col('exploded')[1])
df.select('I', 'exploded', 'P', 'N').show(truncate=False)

最终输出：

+----+-----------------------+----+---------------+
|I   |exploded               |P   |N              |
+----+-----------------------+----+---------------+
|id01|["P1", "1:0.01"]       |"P1"|"1:0.01"       |
|id01|["P2", "3:0.03,4:0.04"]|"P2"|"3:0.03,4:0.04"|
|id01|["P3", "3:0.03,4:0.04"]|"P3"|"3:0.03,4:0.04"|
+----+-----------------------+----+---------------+

Answer 2

这是我最后解决此问题的方式：

 #This method replaces "," with ";" to 
 #distinguish between other camas in the string to split it
 def _comma_replacement(val):
    if (val):
        val = val.replace('","', '";"').replace('{','').replace('}', '')
    return val

replacing = UserDefinedFunction(lambda x: _comma_replacement(x))
new_df = df.withColumn("F", replacing(col("F")))
new_df = new_df.withColumn("F",split(col("F"),";").cast(ArrayType(StringType())))
exploded_df = new_df.withColumn("F", explode("F"))
df_sep = exploded_df.withColumn("F",split(col("F"),'":"').cast(ArrayType(StringType())))
dff = df_sep.withColumn("P", df_sep["F"].getItem(0))
dff_new = dff.withColumn("N", dff["F"].getItem(1))
dff_new = dff_new.drop('F')

使用另一个UDF，我删除了在字符串操作过程中剩余的多余字符。

以上解决方案也使用相同的方法。关键思想是区分不同组件之间和内部的逗号。为此，我建议在UDF中调用_comma_replacement（val）方法。上述解决方案也使用相同的方法，但使用的regxp_replace可以进行更优化。

如何在Pyspark数据帧中查询字典格式列

2 个答案: