Pyspark:将列爆炸到新数据框

时间:2020-04-23 17:45:12

标签: python pyspark

我有一些带有模式的pyspark数据框:

 |-- doc_id: string (nullable = true)     
 |-- msp_contracts: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _el1: string (nullable = true)
 |    |    |-- _el2: long (nullable = true)
 |    |    |-- _el3: string (nullable = true)
 |    |    |-- _el4: string (nullable = true)
 |    |    |-- _el5: string (nullable = true)

如何获取此数据框:

|-- doc_id: string (nullable = true)
|-- _el1: string (nullable = true)
|-- _el3: string (nullable = true)
|-- _el4: string (nullable = true)
|-- _el5: string (nullable = true)

我尝试选择:

explode('msp_contracts').select(
 col(u'msp_contracts.element._el1'),
 col(u'msp_contracts.element._el2')
)

但我可能会出错:

'Column' object is not callable

2 个答案:

答案 0 :(得分:2)

explode('msp_contracts')之后,由于爆炸(如果未提供别名),spark将添加 col 列。

df.select("doc_id",explode("msp_contracts")).show()
#+------+---+
#|doc_id|col|
#+------+---+
#|     1|[1]|
#+------+---+

使用 col 选择_el1,尝试使用df_1.select("doc_id",explode("msp_contracts")).select("doc_id",col(u"col._el1")).show()

Example:

jsn='{"doc_id":1,"msp_contracts":[{"_el1":1}]}'
df=spark.read.json(sc.parallelize([(jsn)]))

#schema
#root
# |-- doc_id: long (nullable = true)
# |-- msp_contracts: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- _el1: long (nullable = true)

df.withColumn("msp_contracts",explode(col("msp_contracts"))).\
select("doc_id","msp_contracts._el1").show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#|     1|   1|
#+------+----+

UPDATE:

df.select("doc_id",explode("msp_contracts")).\
select("doc_id","col._el1").\
show()
#or
df.select("doc_id",explode("msp_contracts")).\
select("doc_id",col(u"col._el1")).\
show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#|     1|   1|
#+------+----+

答案 1 :(得分:0)

为我工作:

df.select("doc_id",explode("msp_contracts")).\ 
   select("doc_id","col._el1")

带有别名和Costum列:

df.select(
        'doc_id',
        explode('msp_contracts').alias("msp_contracts")
        )\
        .select(
            'doc_id',
            col('msp_contracts.el_1').alias('last_period_44fz_customer'),
            col('msp_contracts.el_2').alias('last_period_44fz_customer_inn')
        )\
        .withColumn("load_dtm", now_f())