如何使用spark解析jsonfile

时间:2016-10-21 03:48:47

标签: json scala apache-spark

我有一个要解析的json文件.json格式是这样的:

{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}

我必须得到文件中的每一个字。如何从数组中获取"major"并且我必须得到"省"使用方法df.select("cv_parse.basic_info.location.province")

这是我想要的结果:

cv_id   major   degree  birthyear   state
001   English   Bachelor  1984     New York
001   English   Master    1984     New York

1 个答案:

答案 0 :(得分:0)

这可能不是最好的方法,但你可以试一试。

from turtle import *
t = Turtle()
screen = t.getscreen()

rows = screen.numinput('Number of rows',
                       'How many rows shall there be?', 5, 1, 10)
columns = screen.numinput('Number of columns',
                          'How many columns shall there be?', 5, 1, 10)
side_length = screen.numinput('Length of square side',
                              'How long shall the square sides be?', 30, 10, 50)
first_color = screen.textinput('First color',
                               'What shall the first color be?')
second_color = screen.textinput('Second color',
                                'What shall the second color be?')
third_color = screen.textinput('Third color',
                               'What shall the third color be?')

square_color = ''


def draw_square():
    t.begin_fill()
    t.pendown()
    t.forward(side_length)
    t.left(90)
    t.forward(side_length)
    t.left(90)
    t.forward(side_length)
    t.left(90)
    t.forward(side_length)
    t.color(square_color)
    t.end_fill()
    t.penup()
    t.color('black')
    t.left(90)
    t.forward(side_length)



def draw_board():
    n = 1
    for i in range(int(columns)):
        draw_square()
    for x in range(int(rows - 1)):
        t.goto(0,side_length * n)
            for i in range(int(columns)):
            draw_square()
        n += 1
for i in range(int(columns)):
    for x in range(int(rows)):
        if x + i % 3 == 0:
            square_color = first_color
        elif x + i % 3 == 1:
            square_color = second_color
        elif x + i % 3 == 2:
            square_color = third_color
draw_board()
done()

您的架构将是:

// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._

//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")

jsonDf.printSchema

现在您需要爆炸root |-- cv_id: string (nullable = true) |-- cv_parse: struct (nullable = true) | |-- basic_info: struct (nullable = true) | | |-- birthyear: string (nullable = true) | | |-- location: struct (nullable = true) | | | |-- state: string (nullable = true) | |-- educations: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- degree: string (nullable = true) | | | |-- major: string (nullable = true)

educations

现在你的架构将是

 val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
      $"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")

  explodedResult.printSchema

现在您可以选择列

 root
 |-- cv_id: string (nullable = true)
 |-- col: struct (nullable = true)
 |    |-- degree: string (nullable = true)
 |    |-- major: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- state: string (nullable = true)