根据json子分支扩展行。
例如:
{"attr1" : "attrValue1",
"attr2" : "attrValue2",
"properties": {
"prop1" : "propValue1",
"prop2" : "propValue2"
}
}
结果数据框:
attr1 | attr2 | propertyKey | propertyValue
attrValue1 | attrValue2 | prop1 | propValue1
attrValue1 | attrValue2 | prop2 | propValue2
答案 0 :(得分:1)
假设您有一个数据框:
df.show()
+----------+----------+--------------------+
| attr1| attr2| properties|
+----------+----------+--------------------+
|attrValue1|attrValue2|Map(prop2 -> prop...|
+----------+----------+--------------------+
您可以使用explode
函数与alias
创建两列,一列对应于键,另一列对应于值:
在pyspark
:
import pyspark.sql.functions as F
df.select('*', F.explode(df.properties).alias('propertyKey', 'propertyValue')).drop('properties').show()
+----------+----------+-----------+-------------+
| attr1| attr2|propertyKey|propertyValue|
+----------+----------+-----------+-------------+
|attrValue1|attrValue2| prop2| propValue2|
|attrValue1|attrValue2| prop1| propValue1|
+----------+----------+-----------+-------------+
答案 1 :(得分:0)
希望这有帮助!
import json
#sample data - convert JSON to dataframe
js = {"attr1" : "attrValue1",
"attr2" : "attrValue2",
"properties": {
"prop1" : "propValue1",
"prop2" : "propValue2"
}
}
df = sqlContext.read.json(sc.parallelize([json.dumps(js)]))
df.show()
#convert above dataframe to desired format
#wide format
df = df.select("*", "properties.*").drop("properties")
df.show()
#long format
df = df.selectExpr("attr1", "attr2", "stack(2, 'prop1', prop1, 'prop2', prop2) as (propertyKey, propertyValue)")
df.show()
示例数据:
+----------+----------+--------------------+
| attr1| attr2| properties|
+----------+----------+--------------------+
|attrValue1|attrValue2|[propValue1,propV...|
+----------+----------+--------------------+
宽幅数据:
+----------+----------+----------+----------+
| attr1| attr2| prop1| prop2|
+----------+----------+----------+----------+
|attrValue1|attrValue2|propValue1|propValue2|
+----------+----------+----------+----------+
输出数据(长格式):
+----------+----------+-----------+-------------+
| attr1| attr2|propertyKey|propertyValue|
+----------+----------+-----------+-------------+
|attrValue1|attrValue2| prop1| propValue1|
|attrValue1|attrValue2| prop2| propValue2|
+----------+----------+-----------+-------------+