我正在使用spark 2.0.1和python-2.7来修改和展平一些嵌套的JSON数据。
原始数据(json格式)
{
"created" : '28-12-2001T12:02:01.143',
"class" : 'Class_A',
"sub_class": "SubClass_B",
"properties": {
meta : 'some-info',
...,
interests : {"key1": "value1", "key2":"value2, ..., "keyN":"valueN"}
}
}
使用withColumn
和udf
函数我能够将raw_data压缩到数据框中,如下所示
---------------------------------------------------------------------
| created | class | sub_class | meta | interests |
---------------------------------------------------------------------
|28-12-2001T12:02:01.143 | Class_A | SubClass_B |'some-info' | "{key1: 'value1', 'key2':'value2', ..., 'keyN':'valueN'}" |
---------------------------------------------------------------------
现在我想根据兴趣列将这1行转换/拆分成多行。我怎么能这样做?
期望输出
---------------------------------------------------------------------
| created | class | sub_class | meta | key | value |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key1 | value1 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key2 | value2 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | keyN | valueN |
---------------------------------------------------------------------
谢谢
答案 0 :(得分:0)
使用爆炸
以下是完整示例(主要是获取数据):
import pyspark.sql.functions as sql
import pandas as pd
#sc = SparkContext()
sqlContext = SQLContext(sc)
s = "28-12-2001T12:02:01.143 | Class_A | SubClass_B |some-info| {'key1': 'value1', 'key2':'value2', 'keyN':'valueN'}"
data = s.split('|')
data = data[:-1]+[eval(data[-1])]
p_df = pd.DataFrame(data).T
s_df = sqlContext.createDataFrame(p_df,schema= ['created','class','sub_class','meta','intrests'])
s_df.select(s_df.columns[:-1]+[sql.explode(s_df.intrests).alias("key", "value")]).show()