我有一个如下所示的数据集:
~ ❯ head example.csv
ix,value
1,{"abc": {"name": "bob", "profession": "engineer"}}
2,{"def": {"name": "sarah", "profession": "scientist"}, "ghi": {"name": "matt", "profession": "doctor"}}
value
列包含JSON blob。如您所见,每个JSON blob本身的格式为{A:B},其中A是随机/任意字符串,B是格式相对较好的JSON对象。
我希望从中得到的结果是:
ix,names,professions
1,[bob],[engineer]
2,[sarah,matt],[scientist,doctor]
然后爆发:
ix,name,profession
1,bob,engineer
2,sarah,scientist
2,matt,doctor
因为我不知道A的可能键,所以我很难将JSON blob解析为StructType(我无法枚举所有可能的键)或MapType(from_json不支持):
>>> rdd.withColumn('parsed', F.from_json(F.col('value'), MapType(StringType(), MapType(StringType(), StringType(), False), False)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/gberger/Projects/spark/python/pyspark/sql/dataframe.py", line 1800, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "/Users/gberger/Projects/spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/Users/gberger/Projects/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'jsontostructs(`value`)' due to data type mismatch: Input schema map<string,map<string,string>> must be a struct or an array of structs.;;\n'Project [id#35, value#36, jsontostructs(MapType(StringType,MapType(StringType,StringType,false),false), value#36, Some(Europe/London)) AS parsed#46]\n+- Relation[id#35,value#36] csv\n"
我知道我可以使用UDF,但会严重影响性能;我希望尽可能保留原生Spark功能。
答案 0 :(得分:-1)
您可以使用类似的内容:
首先定义您的架构:
jsonSchema = StructType([ StructField("name", StringType(), True),
StructField("profession", StringType(), True)
])
df = df.withColumn("value", from_json(df["value"], jsonSchema))
选择json属性并形成数据框
df = df.select("value.name", "value.profession")
希望这能回答您的查询。