在Spark中,此json在dataframe(DF)中,现在我们必须导航到表(在基于cust的json中),我们必须读取表的第一个块并必须准备sql查询。
例如:SELECT CUST_NAME FROM CUST WHERE CUST_ID =112
我们必须在数据库中执行此查询并将结果存储在json文件中。
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
root |-- cust: string (nullable = true)
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Name: string (nullable = true)
| | |-- op: string (nullable = true)
| | |-- param1: string (nullable = true)
| | |-- table_NAME: string (nullable = true)
| | |-- val: string (nullable = true)
与第二个表块相同。
例如:SELECT MONTHLY_SALE FROM SALE WHERE COUNTRY = 'IND'
必须在数据库中执行此查询,并且还必须将此结果存储在上述json文件中。
什么是最好的方法?有什么想法吗?
答案 0 :(得分:0)
这是我实现这一目标的方法。对于整个解决方案,我使用了spark-shell。这些是一些先决条件:
将zip文件提取到任意位置
现在使用此命令运行spark-shell
spark-shell --jars path/to/jars/json-serde-cdh5-shim-1.3.7.3.jar,path/to/jars/json-serde-1.3.7.3.jar,path/to/jars/json-1.3.7.3.jar
您的Json文档:
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
折叠版:
{"cust": "Retails","tables":[{"Name":"customer","table_NAME":"cust","param1":"cust_id","val":"112","op":"cust_name"},{"Name":"sales","table_NAME":"sale","param1":"country","val":"ind","op":"monthly_sale"}]}
我已将此json放入此 /tmp/sample.json
现在要转到 spark-sql 部分:
基于json模式创建表
sql("CREATE TABLE json_table(cust string,tables array<struct<Name: string,table_NAME:string,param1:string,val:string,op:string>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'")
现在将json数据加载到表中
sql("LOAD DATA LOCAL INPATH '/tmp/sample.json' OVERWRITE INTO TABLE json_table")
现在,我将使用蜂巢侧视图Lateral view
val ans=sql("SELECT myCol FROM json_table LATERAL VIEW explode(tables) myTable as myCol").collect
返回结果的模式:
ans.printSchema
root
|-- table: struct (nullable = true)
| |-- Name: string (nullable = true)
| |-- table_NAME: string (nullable = true)
| |-- param1: string (nullable = true)
| |-- val: string (nullable = true)
| |-- op: string (nullable = true)
ans.show的结果
ans.show
+--------------------+
| table|
+--------------------+
|[customer,cust,cu...|
|[sales,sale,count...|
+--------------------+
现在我假设可以有两种类型的数据,例如 cust_id 是 Number 类型, 国家 是字符串强>类型。我正在添加一种方法来根据其值识别数据类型。例如
def isAllDigits(x: String) = x forall Character.isDigit
注意:您可以使用自己的方式对此进行识别
7。现在基于json数据创建查询
ans.foreach(f=>{
val splitted_string=f.toString.split(",")
val op=splitted_string(4).substring(0,splitted_string(4).size-2)
val table_NAME=splitted_string(1)
val param1 = splitted_string(2)
val value = splitted_string(3)
if(isAllDigits(value)){
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"="+value)
}else{
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"='"+value+"'")
}
})
这是我得到的结果:
SELECT cust_name FROM cust WHERE cust_id=112
SELECT monthly_sale FROM sale WHERE country='ind'