使用spark sql将json数据加载到hive中

时间:2016-06-10 18:49:59

标签: json scala apache-spark hive apache-spark-sql

我无法将json数据推送到hive下面是示例json数据和我的工作。请告诉我遗失的那个

json数据

    {
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks@gmail.com"
}
]
}

我尝试使用sqlcontext和jsonFile方法来加载无法解析json的文件

val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show 

error is :  java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)

我以不同的方式尝试并且能够破解并获得架构

val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")        
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values

我能够为每条记录单独获取数据和架构,但我错过了获取所有数据并加载到配置单元的想法。

任何帮助或建议都非常适用。

2 个答案:

答案 0 :(得分:3)

这是破解的答案

SELECT s.* 
FROM Suppliers s
WHERE 
    EXISTS (SELECT 1 FROM SupplierItems si1 WHERE si1.supplierpk = s.id and si1.itempk = 1) 
    AND
    EXISTS (SELECT 1 FROM SupplierItems si2 WHERE si2.supplierpk = s.id and si2.itempk = 2) 

有关优化上述代码的任何建议。

答案 1 :(得分:0)

当文件包含每行一个JSON对象时,SparkSQL仅支持读取JSON文件。

SQLContext.scala

  /**
   * Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
   * It goes through the entire dataset once to determine the schema.
   *
   * @group specificdata
   * @deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
   */
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
    read.json(path)
  }

您的文件应该如下所示(严格来说,它不是一个合适的JSON文件)

{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani@gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani@gmail.com"} 
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks@gmail.com"}

请查看未完成的JIRA issue。不要认为这是优先事项,只是为了记录。

您有两个选择

  1. 将您的json数据转换为支持的格式,每行一个对象
  2. 每个JSON对象都有一个文件 - 这会导致文件太多。
  3. 请注意,SQLContext.jsonFile已弃用,请使用SQLContext.read.json

    Examples from spark documentation