有以下数据框:
>>> df.printSchema()
root
|-- I: string (nullable = true)
|-- F: string (nullable = true)
|-- D: string (nullable = true)
|-- T: string (nullable = true)
|-- S: string (nullable = true)
|-- P: string (nullable = true)
F列为字典格式:
{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04",...}
我需要按以下方式阅读F列,并创建两个新的P和N列
P1 => "1:0.01"
P2 => "3:0.03,4:0.04"
and so on
+--------+--------+-----------------+-----+------+--------+----+
| I | P | N | D | T | S | P |
+--------+--------+---------------- +------------+--------+----+
| i1 | p1 | 1:0.01 | d1 | t1 | s1 | p1 |
|--------|--------|-----------------|-----|------|--------|----|
| i1 | p2 | 3:0.03,4:0.04 | d1 | t1 | s1 | p1 |
|--------|--------|-----------------|-----|------|--------|----|
| i1 | p3 | 3:0.03,4:0.04 | d1 | t1 | s1 | p1 |
|--------|--------|-----------------|-----|------|--------|----|
| i2 | ... | .... | d2 | t2 | s2 | p2 |
+--------+--------+-----------------+-----+------+--------+----+
Pyspark有任何建议吗?
答案 0 :(得分:1)
尝试一下:
from pyspark.sql import functions as F
df = spark.createDataFrame([('id01', '{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}')], ['I', 'F'])
df.printSchema()
df.show(truncate=False)
您可以在帖子中看到架构和数据相同。
root
|-- I: string (nullable = true)
|-- F: string (nullable = true)
+----+---------------------------------------------------------+
|I |F |
+----+---------------------------------------------------------+
|id01|{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}|
+----+---------------------------------------------------------+
# remove '{' and '}'
df = df.withColumn('array', F.regexp_replace('F', r'\{', ''))
df = df.withColumn('array', F.regexp_replace('array', r'\}', ''))
# replace the comma with '#' between each sub-dict so we can split on them
df = df.withColumn('array', F.regexp_replace('array', '","', '"#"' ))
df = df.withColumn('array', F.split('array', '#'))
df.show(truncate=False)
这是中间结果
+----+---------------------------------------------------------+-----------------------------------------------------------+
|I |F |array |
+----+---------------------------------------------------------+-----------------------------------------------------------+
|id01|{"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04"}|["P1":"1:0.01", "P2":"3:0.03,4:0.04", "P3":"3:0.03,4:0.04"]|
+----+---------------------------------------------------------+-----------------------------------------------------------+
# generate one row for each element int he array
df = df.withColumn('exploded', F.explode(df['array']))
# Need to distinguish ':' in the dict and in the value
df = df.withColumn('exploded', F.regexp_replace('exploded', '":"', '"#"' ))
df = df.withColumn('exploded', F.split('exploded', '#'))
# extract the name and value
df = df.withColumn('P', F.col('exploded')[0])
df = df.withColumn('N', F.col('exploded')[1])
df.select('I', 'exploded', 'P', 'N').show(truncate=False)
最终输出:
+----+-----------------------+----+---------------+
|I |exploded |P |N |
+----+-----------------------+----+---------------+
|id01|["P1", "1:0.01"] |"P1"|"1:0.01" |
|id01|["P2", "3:0.03,4:0.04"]|"P2"|"3:0.03,4:0.04"|
|id01|["P3", "3:0.03,4:0.04"]|"P3"|"3:0.03,4:0.04"|
+----+-----------------------+----+---------------+
答案 1 :(得分:0)
这是我最后解决此问题的方式:
#This method replaces "," with ";" to
#distinguish between other camas in the string to split it
def _comma_replacement(val):
if (val):
val = val.replace('","', '";"').replace('{','').replace('}', '')
return val
replacing = UserDefinedFunction(lambda x: _comma_replacement(x))
new_df = df.withColumn("F", replacing(col("F")))
new_df = new_df.withColumn("F",split(col("F"),";").cast(ArrayType(StringType())))
exploded_df = new_df.withColumn("F", explode("F"))
df_sep = exploded_df.withColumn("F",split(col("F"),'":"').cast(ArrayType(StringType())))
dff = df_sep.withColumn("P", df_sep["F"].getItem(0))
dff_new = dff.withColumn("N", dff["F"].getItem(1))
dff_new = dff_new.drop('F')
使用另一个UDF,我删除了在字符串操作过程中剩余的多余字符。
以上解决方案也使用相同的方法。关键思想是区分不同组件之间和内部的逗号。为此,我建议在UDF中调用_comma_replacement(val)方法。上述解决方案也使用相同的方法,但使用的regxp_replace可以进行更优化。