有没有一种通用的方式来读取spark中的多行json。更具体地说是火花?

时间:2019-01-16 04:21:52

标签: python json apache-spark pyspark

我有这样的多行json

  

{“ _id”:{“ $ oid”:“ 50b59cd75bed76f46522c34e”},“ student_id”:0,“ class_id”:2,“ scores”:[{“ type”:“ exam”,“ score”:57.92947112575566 },{“ type”:“测验”,“得分”:21.24542588206755},{“ type”:“ homework”,“ score”:68.19567810587429},{“ type”:“ homework”,“ score”:67.95019716560351}, {“ type”:“ homework”,“ score”:18.81037253352722}]}

这只是json的1行。并且还有其他文件。我正在寻找一种在pyspark / spark中读取此文件的方法。可以独立于json格式吗?

我需要以“分数”形式作为单个列的输出,例如scores_exam应该是一列,值为57.92947112575566,score_quiz是另一列,其值为21.24542588206755。

感谢您的帮助。

1 个答案:

答案 0 :(得分:2)

是。

使用多行true选项

from pyspark.sql.functions import explode, col

val df = spark.read.option("multiline", "true").json("multi.json")

您将获得以下输出。

+--------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|_id                       |class_id|scores                                                                                                                                            |student_id|
+--------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|[50b59cd75bed76f46522c34e]|2       |[[57.92947112575566, exam], [21.24542588206755, quiz], [68.1956781058743, homework], [67.95019716560351, homework], [18.81037253352722, homework]]|0         |
+--------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+----------+

添加这些行以获得

  val df2= df.withColumn("scores",explode(col("scores")))
      .select(col("_id.*"), col("class_id"),col("scores.*"),col("student_id"))

+------------------------+--------+-----------------+--------+----------+
|$oid                    |class_id|score            |type    |student_id|
+------------------------+--------+-----------------+--------+----------+
|50b59cd75bed76f46522c34e|2       |57.92947112575566|exam    |0         |
|50b59cd75bed76f46522c34e|2       |21.24542588206755|quiz    |0         |
|50b59cd75bed76f46522c34e|2       |68.1956781058743 |homework|0         |
|50b59cd75bed76f46522c34e|2       |67.95019716560351|homework|0         |
|50b59cd75bed76f46522c34e|2       |18.81037253352722|homework|0         |
+------------------------+--------+-----------------+--------+----------+

请注意,我们正在使用spark中的“ col”和“ explode”函数,因此,您需要执行以下导入操作才能使这些函数正常工作。

从pyspark.sql.functions导入爆炸,col

您可以在下一页上进一步了解如何使用多行解析JSON文件。

https://docs.databricks.com/spark/latest/data-sources/read-json.html

谢谢