如何在pyspark数据框中爆炸数组json类型的列?

时间:2020-01-14 15:26:44

标签: apache-spark pyspark

我试图将此列分解为多列,但是即使我已将其指定为数组数据类型,但似乎该数据类型还是有问题。

此列的外观如下:

        Column_x

[[{"Key":"a","Value":"40000.0"},{"Key":"b","Value":"0.0"},{"Key":"c","Value":"0.0"},{"Key":"f","Value":"false"},{"Key":"e","Value":"ADB"},{"Key":"d","Value":"true"}]]

[[{"Key":"a","Value":"100000.0"},{"Key":"b","Value":"1.5"},{"Key":"c","Value":"1.5"},{"Key":"d","Value":"false"},{"Key":"e","Value":"Rev30"},{"Key":"f","Value":"true"},{"Key":"g","Value":"48600.0"},{"Key":"g","Value":"0.0"},{"Key":"h","Value":"0.0"}],[{"Key":"i","Value":"100000.0"},{"Key":"j","Value":"1.5"},{"Key":"k","Value":"1.5"},{"Key":"l","Value":"false"},{"Key":"m","Value":"Rev30"},{"Key":"n","Value":"true"},{"Key":"o","Value":"48600.0"},{"Key":"p","Value":"0.0"},{"Key":"q","Value":"0.0"}]]

对于这样的事情:

Key   Value
a     10000
b     200000
.
.
.
.
a     100000.0
b     1.5

这是我到目前为止的工作:

from pyspark.sql.types import *

schema = ArrayType(ArrayType(StructType([StructField("Key", StringType()),
                                    StructField("Value", StringType())])))

kn_sx = kn_s\
  .withColumn("Keys", F.explode((F.from_json("Column_x", schema))))\
  .withColumn("Key", col("Keys.Key"))\
  .withColumn("Values", F.explode((F.from_json("Column_x", schema))))\
  .withColumn("Value", col("Values.Value"))\
  .drop("Values")

这是错误:

AnalysisException: u"cannot resolve 'jsontostructs(`Column_x`)' due to data type mismatch: argument 1 requires string type, however, '`Column_x`' is of array<array<struct<Key:string,Value:string>>> type

真的很感谢您的帮助。

2 个答案:

答案 0 :(得分:0)

请参考this以获得get_json_object的文档

>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
>>> df = spark.createDataFrame(data, ("key", "jstring"))
>>> df.select(df.key, get_json_object(df.jstring, '$.f1').alias("c0"), \
...                   get_json_object(df.jstring, '$.f2').alias("c1") ).collect()
[Row(key=u'1', c0=u'value1', c1=u'value2'), Row(key=u'2', c0=u'value12', c1=None)]

答案 1 :(得分:0)

这是我所做的工作:

# Took out a single array element

df = df.withColumn('Column_x', F.col('Column_x.MetaData.Parameters').getItem(0))
# can be modified for additional array elements

# Used Explode on the dataframe to make it work
df = df\
  .withColumn("Keys", F.explode(F.col("Column_x")))\
  .withColumn("Key", col("Keys.Key"))\
  .withColumn("Value", col("Keys.Value"))\
  .drop("Keys")\
  .dropDuplicates()

我希望它能找到寻求此问题帮助的人。