Spark中的动态查询准备和执行

时间:2018-11-27 05:37:14

标签: apache-spark apache-spark-sql apache-spark-mllib

在Spark中,此json在dataframe(DF)中,现在我们必须导航到表(在基于cust的json中),我们必须读取表的第一个块并必须准备sql查询。         例如:SELECT CUST_NAME FROM CUST WHERE CUST_ID =112

我们必须在数据库中执行此查询并将结果存储在json文件中。

{
     "cust": "Retails",
     "tables": [
        {
             "Name":"customer",
             "table_NAME":"cust",
             "param1":"cust_id",  
             "val":"112",
             "op":"cust_name"
        },
        {
             "Name":"sales",
             "table_NAME":"sale",
             "param1":"country",  
             "val":"ind",
             "op":"monthly_sale"
         }]
}

 root |-- cust: string (nullable = true) 
      |-- tables: array (nullable = true) 
      | |-- element: struct (containsNull = true) 
      | | |-- Name: string (nullable = true) 
      | | |-- op: string (nullable = true) 
      | | |-- param1: string (nullable = true) 
      | | |-- table_NAME: string (nullable = true) 
      | | |-- val: string (nullable = true) 

与第二个表块相同。         例如:SELECT MONTHLY_SALE FROM SALE WHERE COUNTRY = 'IND'

必须在数据库中执行此查询,并且还必须将此结果存储在上述json文件中。

什么是最好的方法?有什么想法吗?

1 个答案:

答案 0 :(得分:0)

这是我实现这一目标的方法。对于整个解决方案,我使用了spark-shell。这些是一些先决条件:

  1. json-serde

  2. 下载此jar
  3. 将zip文件提取到任意位置

  4. 现在使用此命令运行spark-shell

    spark-shell --jars path/to/jars/json-serde-cdh5-shim-1.3.7.3.jar,path/to/jars/json-serde-1.3.7.3.jar,path/to/jars/json-1.3.7.3.jar
    

您的Json文档:

{
 "cust": "Retails",
 "tables": [
    {
         "Name":"customer",
         "table_NAME":"cust",
         "param1":"cust_id",  
         "val":"112",
         "op":"cust_name"
    },
    {
         "Name":"sales",
         "table_NAME":"sale",
         "param1":"country",  
         "val":"ind",
         "op":"monthly_sale"
     }]
}

折叠版:

{"cust": "Retails","tables":[{"Name":"customer","table_NAME":"cust","param1":"cust_id","val":"112","op":"cust_name"},{"Name":"sales","table_NAME":"sale","param1":"country","val":"ind","op":"monthly_sale"}]}

我已将此json放入此 /tmp/sample.json

现在要转到 spark-sql 部分:

  1. 基于json模式创建表

    sql("CREATE TABLE json_table(cust string,tables array<struct<Name: string,table_NAME:string,param1:string,val:string,op:string>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'")
    
  2. 现在将json数据加载到表中

    sql("LOAD DATA LOCAL INPATH  '/tmp/sample.json' OVERWRITE INTO TABLE json_table")
    
  3. 现在,我将使用蜂巢侧视图Lateral view

    val ans=sql("SELECT myCol FROM json_table LATERAL VIEW explode(tables) myTable as myCol").collect
    
  4. 返回结果的模式:

        ans.printSchema
        root
         |-- table: struct (nullable = true)
         |    |-- Name: string (nullable = true)
         |    |-- table_NAME: string (nullable = true)
         |    |-- param1: string (nullable = true)
         |    |-- val: string (nullable = true)
         |    |-- op: string (nullable = true)
    
  5. ans.show的结果

         ans.show
         +--------------------+
         |               table|
         +--------------------+
         |[customer,cust,cu...|
         |[sales,sale,count...|
         +--------------------+
    
  6. 现在我假设可以有两种类型的数据,例如 cust_id Number 类型, 国家 字符串类型。我正在添加一种方法来根据其值识别数据类型。例如

    def isAllDigits(x: String) = x forall Character.isDigit
    

    注意:您可以使用自己的方式对此进行识别

7。现在基于json数据创建查询

    ans.foreach(f=>{
val splitted_string=f.toString.split(",")
val op=splitted_string(4).substring(0,splitted_string(4).size-2)
val table_NAME=splitted_string(1)
val param1 = splitted_string(2)
val value = splitted_string(3)
if(isAllDigits(value)){
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"="+value)
}else{
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"='"+value+"'")
}
})

这是我得到的结果:

SELECT cust_name FROM cust WHERE cust_id=112
SELECT monthly_sale FROM sale WHERE country='ind'