从Spark逻辑计划获取属性沿袭

时间:2019-08-13 04:31:44

标签: java scala apache-spark apache-spark-sql

我需要从sql查询中获取属性/列沿袭。

基本上,我的意思是沿袭是“输出与输入”列中的关系。让我们看看下面的示例。

我有两个桌子

餐桌顺序

OrderID  CustomerID  OrderDate
10308     2          9/18/1996
10309     37         9/19/1996
10310     77         9/20/1996

表客户

CustomerID  CustomerName        ContactName    Country
 1            Riya             Maria Anders    Germany
 2            John             Ana Trujillo    Mexico
 3            Anthany          Antonio Moreno  Mexico

我使用sparksession将两个表都读取为DataFrame并转换为tempView

var oDF = sparkSession.
  read.
  format("csv").
  option("header", EdlConstants.TRUE).
  option("inferschema", EdlConstants.TRUE).
  option("delimiter", ",").
  option("decoding", EdlConstants.UTF8).
  option("multiline", true).
  load("C:\\Users\\tneja\\Trunk\\scripts\\el\\el-dt\\src\\test\\resources\\files\\order.csv")

println("smaple data in oDF====>")
oDF.show()

var cusDF = sparkSession.
  read.
  format("csv").
  option("header", EdlConstants.TRUE).
  option("inferschema", EdlConstants.TRUE).
  option("delimiter", ",").
  option("decoding", EdlConstants.UTF8).
  option("multiline", true).
  load("C:\\Users\\tneja\\Trunk\\scripts\\el\\el-dt\\src\\test\\resources\\files\\customer.csv")

println("smaple data in cusDF====>")
cusDF.show()


oDF.createOrReplaceTempView("orderTempView")
cusDF.createOrReplaceTempView("customerTempView")

现在,我要同时连接表和在其上编写查询,并在select中提供别名,如下所示

val res = sqlContext.sql("select OID as OID_new, CID as CID_new from (select ot.OrderID as OID,ct.CustomerID as CID,ot.OrderID+ct.CustomerName as MID  from orderTempView as ot inner join customerTempView as ct on ot.CustomerID = ct.CustomerID)")
val analyzedPlan = res.queryExecution.analyzed

分析计划如下所示。

Project [OID#36 AS OID_new#39, CID#37 AS CID_new#40]
+- Project [OrderID#0 AS OID#36, CustomerID#15 AS CID#37, (cast(OrderID#0 as 
double) + cast(CustomerName#16 as double)) AS MID#38]
+- Join Inner, (CustomerID#1 = CustomerID#15)
  :- SubqueryAlias ot
  :  +- SubqueryAlias ordertempview
  :     +- Relation[OrderID#0,CustomerID#1,OrderDate#2] csv
  +- SubqueryAlias ct
     +- SubqueryAlias customertempview
        +- Relation[CustomerID#15,CustomerName#16,ContactName#17,Country#18] csv

似乎这种逻辑计划是树结构。

所以我需要提取输出和输入的属性值。

所以我需要在Map中获取如下所示的值以获取关系

Map("OID_NEW" -> "orderTempView.OrderID", "CID_NEW"->"customerTempView.CustomerID")

其中OID_NEW是输出列,orderTempView.OrderID是它的输入列,与其他列相同。

问题在于如何遍历此逻辑计划(analyzedPlan)。如何以有用的方式从中提取数据?

如果有人可以帮助我从analyticsPlan获得类似这样的输出,那将是非常感谢。 谢谢!

0 个答案:

没有答案