pyspark

时间:2017-03-29 09:54:56

标签: pyspark

我正在使用spark 2.0.1和python-2.7来修改和展平一些嵌套的JSON数据。

原始数据(json格式)

{
 "created" : '28-12-2001T12:02:01.143',
 "class" : 'Class_A',
 "sub_class": "SubClass_B",
 "properties": {
    meta : 'some-info',
    ...,
    interests : {"key1": "value1", "key2":"value2, ..., "keyN":"valueN"}
  } 
}

使用withColumnudf函数我能够将raw_data压缩到数据框中,如下所示

---------------------------------------------------------------------
| created                | class   | sub_class  | meta       | interests                                                 | 
---------------------------------------------------------------------
|28-12-2001T12:02:01.143 | Class_A | SubClass_B |'some-info' | "{key1: 'value1', 'key2':'value2', ..., 'keyN':'valueN'}" |
---------------------------------------------------------------------

现在我想根据兴趣列将这1行转换/拆分成多行。我怎么能这样做?

期望输出

---------------------------------------------------------------------
| created                 | class   | sub_class  | meta        | key  | value  |  
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key1 | value1 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key2 | value2 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | keyN | valueN |
---------------------------------------------------------------------

谢谢

1 个答案:

答案 0 :(得分:0)

使用爆炸

以下是完整示例(主要是获取数据):

import pyspark.sql.functions as sql
import pandas as pd
#sc = SparkContext()
sqlContext = SQLContext(sc)

s = "28-12-2001T12:02:01.143 | Class_A | SubClass_B |some-info| {'key1': 'value1', 'key2':'value2', 'keyN':'valueN'}"
data = s.split('|')
data = data[:-1]+[eval(data[-1])]
p_df = pd.DataFrame(data).T
s_df = sqlContext.createDataFrame(p_df,schema=  ['created','class','sub_class','meta','intrests'])

s_df.select(s_df.columns[:-1]+[sql.explode(s_df.intrests).alias("key", "value")]).show()