我有一个具有嵌套结构的数据框,因此我确定它是structType,但是由于它是从json转换而来的,因此它将模式推断为字符串而不是struct。我希望它保持结构。我应该如何解决这个问题?
数据框当前如下所示:
_________________________________________________________________
|issuingClub | memberCards |
-------------------------------------------------------------------
| 1234 |[{u'createdClub': u'1234', u'cardStatus': u'ACTIVE',|
| u'issuedReason': u'new member', u'cardType': |
| u'MEMBERSHIPACCOUNT', u'cardNumber':u'109214092'}] |
-------------------------------------------------------------------
| 3712 |[{u'createdClub': u'3712', u'cardStatus': u'EXPIRE',|
| u'issuedReason': u'old member', u'cardType': |
| u'MEMBERSHIPACCOUNT', u'cardNumber':u'109214092'}] |
-------------------------------------------------------------------
此数据框的架构被推断为:
root:
|-- issuingClub: string (nullable = true)
|-- memberCards: string (nullable = true)
我不想将memberCards转换为字符串,而是将其转换为StructType。我该怎么做呢?请帮助!
我尝试使用此代码:
import json
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["createdClub"],item["cardStatus"],item["issuedReason"],item["cardType"],item["cardNumber"])
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField
json_schema = ArrayType(StructType([StructField("u'createdClub", StringType(), nullable=False), StructField("u'cardStatus", StringType(), nullable=False),
StructField("u'issuedReason", StringType(), nullable=False),
StructField("u'cardType",StringType(),nullable=False),
StructField("u'cardNumber",StringType(),nullable=False)]))
from pyspark.sql.functions import udf
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
df_new = old_df.select(delta_final_df["issuingClub"], udf_parse_json(delta_final_df["memberCards"]).alias("memberCards"))
由于以下原因得到此错误:
ValueError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)