我有3个不同进程生成的3dataframes。 每个数据框都具有相同名称的列。 我的数据框看起来像这样
id val1 val2 val3 val4
1 null null null null
2 A2 A21 A31 A41
id val1 val2 val3 val4
1 B1 B21 B31 B41
2 null null null null
id val1 val2 val3 val4
1 C1 C2 C3 C4
2 C11 C12 C13 C14
在这3个数据帧中,我想创建两个数据帧(最终和合并)。 最终,优惠顺序 - dataFrame 1>数据帧2>数据框3
如果数据帧1中存在结果(val1!= null),我将该行存储在最终数据帧中。
我的最终结果应该是:
id finalVal1 finalVal2 finalVal3 finalVal4
1 B1 B21 B31 B41
2 A2 A21 A31 A41
Consolidated Dataframe将存储所有3个结果。
我怎样才能有效地做到这一点?
答案 0 :(得分:10)
如果我理解正确,那么对于你想要找出第一个非空值的每一行,首先查看第一个表,然后查看第二个表,然后查看第三个表。
您只需要根据id
加入这三个表,然后使用coalesce
函数获取第一个非null元素
import org.apache.spark.sql.functions._
val df1 = sc.parallelize(Seq(
(1,null,null,null,null),
(2,"A2","A21","A31", "A41"))
).toDF("id", "val1", "val2", "val3", "val4")
val df2 = sc.parallelize(Seq(
(1,"B1","B21","B31", "B41"),
(2,null,null,null,null))
).toDF("id", "val1", "val2", "val3", "val4")
val df3 = sc.parallelize(Seq(
(1,"C1","C2","C3","C4"),
(2,"C11","C12","C13", "C14"))
).toDF("id", "val1", "val2", "val3", "val4")
val consolidated = df1.join(df2, "id").join(df3, "id").select(
df1("id"),
coalesce(df1("val1"), df2("val1"), df3("val1")).as("finalVal1"),
coalesce(df1("val2"), df2("val2"), df3("val2")).as("finalVal2"),
coalesce(df1("val3"), df2("val3"), df3("val3")).as("finalVal3"),
coalesce(df1("val4"), df2("val4"), df3("val4")).as("finalVal4")
)
这为您提供了预期的输出
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1| B21| B31| B41|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
答案 1 :(得分:0)
编辑:部分为空行的新解决方案。它避免了连接,但使用窗口函数和不同的...
case class a(id:Int,val1:String,val2:String,val3:String,val4:String)
val df1 = sc.parallelize(List(
a(1,null,null,null,null),
a(2,"A2","A21","A31","A41"),
a(3,null,null,null,null))).toDF()
val df2 = sc.parallelize(List(
a(1,"B1",null,"B31","B41"),
a(2,null,null,null,null),
a(3,null,null,null,null))).toDF()
val df3 = sc.parallelize(List(
a(1,"C1","C2","C3","C4"),
a(2,"C11","C12","C13","C14"),
a(3,"C11","C12","C13","C14"))).toDF()
val anyNotNull = df1.columns.tail.map(c => col(c).isNotNull).reduce(_ || _)
val consolidated = {
df1
.filter(anyNotNull)
.withColumn("foo",lit(1))
.unionAll(df2.filter(anyNotNull).withColumn("foo",lit(2)))
.unionAll(df3.filter(anyNotNull).withColumn("foo",lit(3)))
}
scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1|null| B31| B41|
| 1| B1| C2| B31| B41|
| 3| C11| C12| C13| C14|
| 2| A2| A21| A31| A41|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
val w = Window.partitionBy('id).orderBy('foo)
val coalesced = col("id") +: df1.columns.tail.map(c => first(col(c),true).over(w).as(c))
val finalDF = consolidated.select(coalesced:_*).na.drop.distinct
scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1| C2| B31| B41|
| 3| C11| C12| C13| C14|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
旧解决方案:
如果你只有null
的完整行或者根本没有空,你可以这样做(编辑:优于其他解决方案的优点是你避免使用它)
数据:
case class a(id:Int,val1:String,val2:String,val3:String,val4:String)
val df1 = sc.parallelize(List(
a(1,null,null,null,null),
a(2,"A2","A21","A31","A41"),
a(3,null,null,null,null))).toDF()
val df2 = sc.parallelize(List(
a(1,"B1","B21","B31","B41"),
a(2,null,null,null,null),
a(3,null,null,null,null))).toDF()
val df3 = sc.parallelize(List(
a(1,"C1","C2","C3","C4"),
a(2,"C11","C12","C13","C14"),
a(3,"C11","C12","C13","C14"))).toDF()
合并:
val consolidated = {
df1.na.drop.withColumn("foo",lit(1))
.unionAll(df2.na.drop.withColumn("foo",lit(2)))
.unionAll(df3.na.drop.withColumn("foo",lit(3)))
}
scala> consolidated.show()
+---+----+----+----+----+---+
| id|val1|val2|val3|val4|foo|
+---+----+----+----+----+---+
| 2| A2| A21| A31| A41| 1|
| 1| B1| B21| B31| B41| 2|
| 1| C1| C2| C3| C4| 3|
| 2| C11| C12| C13| C14| 3|
| 3| C11| C12| C13| C14| 3|
+---+----+----+----+----+---+
最终
val w = Window.partitionBy('id).orderBy('foo)
val finalDF = consolidated
.withColumn("foo2",rank().over(w))
.filter('foo2===1)
.drop("foo").drop("foo2")
scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1| B21| B31| B41|
| 3| C11| C12| C13| C14|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
答案 2 :(得分:0)
下面是一个连接六个表/数据框(不使用SQL)的示例
retail_db是一个著名的示例数据库,任何人都可以从Google那里获得它
问题://从TX获取所有购买健身用品的客户
val df_customers = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "customers").option("user", "root").option("password", "root").load()
val df_products = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "products").option("user", "root").option("password", "root").load()
val df_orders = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "orders"). option("user", "root").option("password", "root").load()
val df_order_items = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_items").option("user", "root").option("password", "root").load()
val df_categories = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "categories").option("user", "root").option("password", "root").load()
val df_departments = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "departments").option("user", "root").option("password", "root").load()
val df_order_items_all = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_all").option("user", "root").option("password", "root").load()
val jeCustOrd=df_customers.col("customer_id")===df_orders.col("order_customer_id")
val jeOrdItem=df_orders.col("order_id")===df_order_items.col("order_item_order_id")
val jeProdOrdItem=df_products.col("product_id")===df_order_items.col("order_item_product_id")
val jeProdCat=df_products.col("product_category_id")===df_categories.col("category_id")
val jeCatDept=df_categories.col("category_department_id")===df_departments.col("department_id")
df_customers.where("customer_state = 'TX'").join(df_orders,jeCustOrd).join(df_order_items,jeOrdItem).join(df_products,jeProdOrdItem).join(df_categories,jeProdCat).join(df_departments,jeCatDept).filter("department_name='Fitness'")
.select("customer_id","customer_fname","customer_lname", "customer_street","customer_city","customer_state","customer_zipcode","order_id","category_name","department_name").show(5)
答案 3 :(得分:-3)
如果它们来自三个不同的表格,我会使用下推过滤器在服务器上过滤它们,并使用数据框连接功能之间的连接将它们连接在一起。
如果它们不是来自数据库表;您可以使用过滤器并将高阶函数映射到同一并行。