包含json和非json列的pyspark读取文件

时间:2017-01-30 06:17:15

标签: json csv pyspark flatten

我正在尝试使用json和非json列读取和转换csv文件。 我设法读取文件并将其放入数据框中。架构是这样的:

root
 |-- 'id': string (nullable = true)
 |-- 'score': string (nullable = true)

如果我df.take(2),我会得到以下结果:

[Row('id'=u"'AF03DCAB-EE3F-493A-ACD9-4B98F548E6F3'", 'score'=u"{'topSpeed':15.00000,'averageSpeed':5.00000,'harshBraking':0,'harshAcceleration':0,'driverRating':null,'idlingScore':70,'speedingScore':70,'brakingScore':70,'accelerationScore':70,'totalEcoScore':70 }"), Row('id'=u"'1938A2B9-5EF2-413C-A7A3-C5F324FD4089'", 'score'=u"{'topSpeed':106.00000,'averageSpeed':71.00000,'harshBraking':0,'harshAcceleration':0,'driverRating':9,'idlingScore':76,'speedingScore':87,'brakingScore':86,'accelerationScore':82,'totalEcoScore':83 }")]

id列是“普通”列,score列包含json格式的数据。 我想将json内容分解为单独的列,但还需要id列与其余数据。 A只为得分列提供了一段代码:

df = rawdata.select("'score'")
df1 = df.rdd  # Convert to rdd
df2 = df1.flatMap(lambda x: x)  # Flatten rows
dfJsonScore = sqlContext.read.json(df2)
dfJsonScore.printSchema()
dfJsonScore.take(3)

这给了我这个:

root
 |-- accelerationScore: long (nullable = true)
 |-- averageSpeed: double (nullable = true)
 |-- brakingScore: long (nullable = true)
 |-- driverRating: long (nullable = true)
 |-- harshAcceleration: long (nullable = true)
 |-- harshBraking: long (nullable = true)
 |-- idlingScore: long (nullable = true)
 |-- speedingScore: long (nullable = true)
 |-- topSpeed: double (nullable = true)
 |-- totalEcoScore: long (nullable = true)

[Row(accelerationScore=70, averageSpeed=5.0, brakingScore=70, driverRating=None, harshAcceleration=0, harshBraking=0, idlingScore=70, speedingScore=70, topSpeed=15.0, totalEcoScore=70),
 Row(accelerationScore=82, averageSpeed=71.0, brakingScore=86, driverRating=9, harshAcceleration=0, harshBraking=0, idlingScore=76, speedingScore=87, topSpeed=106.0, totalEcoScore=83),
 Row(accelerationScore=81, averageSpeed=74.0, brakingScore=85, driverRating=9, harshAcceleration=0, harshBraking=0, idlingScore=75, speedingScore=87, topSpeed=102.0, totalEcoScore=82)]

但我无法与id列结合使用。

1 个答案:

答案 0 :(得分:3)

有一个全新的from_json函数added in pyspark 2.1可以处理您的案例。

使用以下架构的数据框架:

>>> df.printSchema()
root
 |-- id: string (nullable = true)
 |-- score: string (nullable = true)

首先为json字段生成模式:

>>> score_schema = spark.read.json(df.rdd.map(lambda row: row.score)).schema

然后在from_json中使用它:

>>> df.withColumn('score', from_json('score', score_schema)).printSchema()
root
 |-- id: string (nullable = true)
 |-- score: struct (nullable = true)
 |    |-- accelerationScore: long (nullable = true)
 |    |-- averageSpeed: double (nullable = true)
 |    |-- brakingScore: long (nullable = true)
 |    |-- driverRating: long (nullable = true)
 |    |-- harshAcceleration: long (nullable = true)
 |    |-- harshBraking: long (nullable = true)
 |    |-- idlingScore: long (nullable = true)
 |    |-- speedingScore: long (nullable = true)
 |    |-- topSpeed: double (nullable = true)
 |    |-- totalEcoScore: long (nullable = true)

修改

如果你不能使用spark 2.1,get_json_object总是一个选项,但要求字段是有效的json,即将"作为字符串分隔符而不是',请参阅此示例:

df.withColumn('score', regexp_replace('score', "'", "\"")) \
    .select(
        'id', 
        get_json_object('score', '$.accelerationScore').alias('accelerationScore'), 
        get_json_object('score', '$.topSpeed').alias('topSpeed')
    ).show()

+--------------------+-----------------+--------+
|                  id|accelerationScore|topSpeed|
+--------------------+-----------------+--------+
|AF03DCAB-EE3F-493...|               70|    15.0|
|1938A2B9-5EF2-413...|               82|   106.0|
+--------------------+-----------------+--------+