我正在使用apache spark来解析json文件。如何从json文件获取嵌套密钥,以及它的数组或嵌套密钥

时间:2017-04-25 13:19:32

标签: java json apache-spark apache-spark-sql

我有多个json文件保持json数据初始化。 Json Structure看起来像这样。

root
 |-- Age: long (nullable = true)
 |-- Company: struct (nullable = true)
 |    |-- Company Name: string (nullable = true)
 |    |-- Domain: string (nullable = true)
 |-- Designation: string (nullable = true)
 |-- Email: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Test: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- location: struct (nullable = true)
 |    |-- City: struct (nullable = true)
 |    |    |-- City Name: string (nullable = true)
 |    |    |-- Pin: long (nullable = true)
 |    |-- State: string (nullable = true)

我试过这个

    +---+--------------+------------------+-----------------+-----------+--------------+--------------------+
|Age|       Company|       Designation|            Email|       Name|          Test|            location|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
| 22|[Elegant,Java]|Trainee Programmer|vpn2330@gmail.com|Vipin Suman|[Test1, Test2]|[[Ahmedabad,32400...|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+

我正在为此获取架构

            Age   |  Company Name    | Domain|  Designation |  Email           |    Name          |  Test   | City Name      |  Pin   |   State   |

           22     | Elegant MicroWeb | Java  |  Programmer  | vpn2330@gmail.com | Vipin Suman     | Test1  |  Ahmedabad      | 324009  | Gujarat 
           22     | Elegant MicroWeb | Java  |  Programmer  | vpn2330@gmail.com | Vipin Suman     | Test2  |  Ahmedabad      | 324009  | 

我正在查看表格: -

users
  $uid
    displayName: ""
    type: ""
contacts
  $uid
    $contactUid: true

我想要结果为: -

$uid

我怎样才能获得上面的表格。我试了一切。我是apache spark的新手可以帮助我吗?

2 个答案:

答案 0 :(得分:0)

我建议你在scala中做你的工作,这更好地受到spark的支持。为了完成你的工作,你可以使用"选择"用于选择特定列的API,使用别名重命名列,您可以参考此处说明如何选择复杂数据格式(https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

根据您的结果,您还需要使用" explode" API(Flattening Rows in Spark

答案 1 :(得分:0)

在Scala中可以这样做:

people.select(
  $"Age",
  $"Company.*",
  $"Designation",
  $"Email",
  $"Name",
  explode($"Test"),
  $"location.City.*",
  $"location.State")

不幸的是,在Java中使用以下代码会失败:

people.select(
  people.col("Age"),
  people.col("Company.*"),
  people.col("Designation"),
  people.col("Email"),
  people.col("Name"),
  explode(people.col("Test")),
  people.col("location.City.*"),
  people.col("location.State"));

您可以使用selectExpr代替:

people.selectExpr(
  "Age",
  "Company.*",
  "Designation",
  "Email",
  "Name",
  "EXPLODE(Test) AS Test",
  "location.City.*",
  "location.State");

<强> PS: 您可以将路径传递给目录,而不是sparkSession.read().json(jsonFiles);中的JSON文件列表。