将字符串类型列转换为pyspark中的struct列

时间:2019-05-03 19:25:17

标签: json dataframe pyspark

我有一个具有嵌套结构的数据框,因此我确定它是structType,但是由于它是从json转换而来的,因此它将模式推断为字符串而不是struct。我希望它保持结构。我应该如何解决这个问题?

数据框当前如下所示:

 _________________________________________________________________
|issuingClub |              memberCards                           |
-------------------------------------------------------------------
|  1234      |[{u'createdClub': u'1234', u'cardStatus': u'ACTIVE',| 
             | u'issuedReason': u'new member', u'cardType':       |
             | u'MEMBERSHIPACCOUNT', u'cardNumber':u'109214092'}] |
-------------------------------------------------------------------
| 3712       |[{u'createdClub': u'3712', u'cardStatus': u'EXPIRE',| 
             | u'issuedReason': u'old member', u'cardType':       |
             | u'MEMBERSHIPACCOUNT', u'cardNumber':u'109214092'}] |
-------------------------------------------------------------------

此数据框的架构被推断为:

 root:
   |-- issuingClub: string (nullable = true)
   |-- memberCards: string (nullable = true)

我不想将memberCards转换为字符串,而是将其转换为StructType。我该怎么做呢?请帮助!

我尝试使用此代码:

 import json

 def parse_json(array_str):
     json_obj = json.loads(array_str)
     for item in json_obj:
         yield (item["createdClub"],item["cardStatus"],item["issuedReason"],item["cardType"],item["cardNumber"])


from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField
json_schema = ArrayType(StructType([StructField("u'createdClub", StringType(), nullable=False), StructField("u'cardStatus", StringType(), nullable=False),
                                StructField("u'issuedReason", StringType(), nullable=False),
                                StructField("u'cardType",StringType(),nullable=False),
                                StructField("u'cardNumber",StringType(),nullable=False)]))

from pyspark.sql.functions import udf

udf_parse_json = udf(lambda str: parse_json(str), json_schema)

df_new = old_df.select(delta_final_df["issuingClub"], udf_parse_json(delta_final_df["memberCards"]).alias("memberCards"))

由于以下原因得到此错误:

ValueError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)

0 个答案:

没有答案