笛卡尔积与Spark中的json子分支

时间:2018-01-18 00:53:20

标签: scala apache-spark pyspark apache-spark-sql spark-dataframe

根据json子分支扩展行。

例如:

{"attr1" : "attrValue1",
"attr2" : "attrValue2",
"properties": {
    "prop1" : "propValue1",
    "prop2" : "propValue2"
    }
}

结果数据框:

attr1      | attr2      | propertyKey | propertyValue

attrValue1 | attrValue2 | prop1       | propValue1
attrValue1 | attrValue2 | prop2       | propValue2

2 个答案:

答案 0 :(得分:1)

假设您有一个数据框:

df.show()
+----------+----------+--------------------+
|     attr1|     attr2|          properties|
+----------+----------+--------------------+
|attrValue1|attrValue2|Map(prop2 -> prop...|
+----------+----------+--------------------+

您可以使用explode函数与alias创建两列,一列对应于键,另一列对应于值:

pyspark

import pyspark.sql.functions as F
df.select('*', F.explode(df.properties).alias('propertyKey', 'propertyValue')).drop('properties').show()
+----------+----------+-----------+-------------+
|     attr1|     attr2|propertyKey|propertyValue|
+----------+----------+-----------+-------------+
|attrValue1|attrValue2|      prop2|   propValue2|
|attrValue1|attrValue2|      prop1|   propValue1|
+----------+----------+-----------+-------------+

答案 1 :(得分:0)

希望这有帮助!

import json

#sample data - convert JSON to dataframe
js = {"attr1" : "attrValue1",
      "attr2" : "attrValue2",
      "properties": {
              "prop1" : "propValue1",
              "prop2" : "propValue2"
              }
      }
df = sqlContext.read.json(sc.parallelize([json.dumps(js)]))
df.show()

#convert above dataframe to desired format
#wide format
df = df.select("*", "properties.*").drop("properties")
df.show()

#long format
df = df.selectExpr("attr1", "attr2", "stack(2, 'prop1', prop1, 'prop2', prop2) as (propertyKey, propertyValue)")
df.show()

示例数据:

+----------+----------+--------------------+
|     attr1|     attr2|          properties|
+----------+----------+--------------------+
|attrValue1|attrValue2|[propValue1,propV...|
+----------+----------+--------------------+

宽幅数据:

+----------+----------+----------+----------+
|     attr1|     attr2|     prop1|     prop2|
+----------+----------+----------+----------+
|attrValue1|attrValue2|propValue1|propValue2|
+----------+----------+----------+----------+

输出数据(长格式):

+----------+----------+-----------+-------------+
|     attr1|     attr2|propertyKey|propertyValue|
+----------+----------+-----------+-------------+
|attrValue1|attrValue2|      prop1|   propValue1|
|attrValue1|attrValue2|      prop2|   propValue2|
+----------+----------+-----------+-------------+