我正在使用Spark-scala将当前的Sql Querys转换为DataFrames我在Query中执行了多个内部联接。实际上我可以在SqlContext.sql(“”)中实现但是我的团队对sqlContext不感兴趣在数据帧之上执行操作
si s inner join
ac a on s.cid = a.cid and s.sid =a.sid
inner join De d on s.cid = d.cid AND d.aid = a.aid
inner join SGrM sgm on s.cid = sgm.cid and s.sid =sgm.sid and sgm.status=1
inner join SiGo sg on sgm.cid =sg.cid and sgm.gid =sg.gid
inner join bg bu on s.cid = bu.cid and s.sid =bu.sid
inner join ls al on a.AtLId = al.lid
inner join ls rl on a.RtLId = rl.lid
inner join ls vl on a.VLId = vl.lid
从我的搜索中我知道我们可以使用
递归加入List(df1,df2,df3,dfN).reduce((a, b) => a.join(b, joinCondition))
但我不能满足上述条件,因为涉及多个条件我该如何执行此操作?
答案 0 :(得分:1)
首先,将DataFrames替换为DataSet和Spark 2. +以通过避免JVM对象来实现更好的性能 - 重新设计Tungsten。
现在,问题:让我们说你有4 x DS:
首先为表创建架构:
case class DS (id: Int, colA: String)
然后读取启用了优化的文件:
val ds1 = spark.read.parquet("X1").as[DS]
val ds2 = spark.read.parquet("X2").as[DS]
val ds3 = spark.read.parquet("X3").as[DS]
val ds4 = spark.read.parquet("X4").as[DS]
现在,您可以逐个加入它们,以便您可以跟踪数据流(只有在拥有小表时才使用广播):
case class JoinedDS (colB: String)
val joinedDS = ds1.join(broadcast(ds2), Seq("id"), "inner")
.join(ds3, Seq("id", "colB"), "inner")
.join(ds4, Seq("id"), "inner")
.select(col("colB")
.as[JoinedDS]
答案 1 :(得分:1)
您可以使用以下多种条件加入多个数据框:
val result = df1.as("df1").join(df2.as("df2"),
$"df1.col1"===$df2.col1" && $"df1.col2"===$df2.col2").join(df3.as("df3"),
$"df3.col1"===$df2.col1" && $"df3.col2"===$df2.col2", "left_outer")
答案 2 :(得分:0)
下面是一个连接六个表/数据框(不使用SQL)的示例
retail_db是一个著名的示例数据库,任何人都可以从Google那里获得它
问题://从TX获取所有购买健身用品的客户
val df_customers = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "customers").option("user", "root").option("password", "root").load()
val df_products = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "products").option("user", "root").option("password", "root").load()
val df_orders = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "orders"). option("user", "root").option("password", "root").load()
val df_order_items = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_items").option("user", "root").option("password", "root").load()
val df_categories = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "categories").option("user", "root").option("password", "root").load()
val df_departments = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "departments").option("user", "root").option("password", "root").load()
val df_order_items_all = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_all").option("user", "root").option("password", "root").load()
val jeCustOrd=df_customers.col("customer_id")===df_orders.col("order_customer_id")
val jeOrdItem=df_orders.col("order_id")===df_order_items.col("order_item_order_id")
val jeProdOrdItem=df_products.col("product_id")===df_order_items.col("order_item_product_id")
val jeProdCat=df_products.col("product_category_id")===df_categories.col("category_id")
val jeCatDept=df_categories.col("category_department_id")===df_departments.col("department_id")
// val jeOrdItem=df_orders.col("")===df_order_items.col("")
//Get all customers from TX who bought fitness items
df_customers.where("customer_state = 'TX'").join(df_orders,jeCustOrd).join(df_order_items,jeOrdItem).join(df_products,jeProdOrdItem).join(df_categories,jeProdCat).join(df_departments,jeCatDept).filter("department_name='Fitness'")
.select("customer_id","customer_fname","customer_lname", "customer_street","customer_city","customer_state","customer_zipcode","order_id","category_name","department_name").show(5)