我构建了两个数据帧。我们如何加入多个Spark数据帧?
例如:
PersonDf
,ProfileDf
的公共列为personId
,为(键)。
现在,我们如何才能将PersonDf
和ProfileDf
合并为一个数据帧?
答案 0 :(得分:41)
您可以使用案例类来准备样本数据集......
这是ex的可选项:您也可以从DataFrame
获得hiveContext.sql
。
import org.apache.spark.sql.functions.col
case class Person(name: String, age: Int, personid : Int)
case class Profile(name: String, personid : Int , profileDescription: String)
val df1 = sqlContext.createDataFrame(
Person("Bindu",20, 2)
:: Person("Raphel",25, 5)
:: Person("Ram",40, 9):: Nil)
val df2 = sqlContext.createDataFrame(
Profile("Spark",2, "SparkSQLMaster")
:: Profile("Spark",5, "SparkGuru")
:: Profile("Spark",9, "DevHunter"):: Nil
)
// you can do alias to refer column name with aliases to increase readablity
val df_asPerson = df1.as("dfperson")
val df_asProfile = df2.as("dfprofile")
val joined_df = df_asPerson.join(
df_asProfile
, col("dfperson.personid") === col("dfprofile.personid")
, "inner")
joined_df.select(
col("dfperson.name")
, col("dfperson.age")
, col("dfprofile.name")
, col("dfprofile.profileDescription"))
.show
样本临时表方法我个人不喜欢...
df_asPerson.registerTempTable("dfperson");
df_asProfile.registerTempTable("dfprofile")
sqlContext.sql("""SELECT dfperson.name, dfperson.age, dfprofile.profileDescription
FROM dfperson JOIN dfprofile
ON dfperson.personid == dfprofile.personid""")
注意: 1)如 @RaphaelRoth 所述,
val resultDf = PersonDf.join(ProfileDf,Seq("personId"))
很好 因为如果你使用同一个表的内连接,它不会有来自双方的重复列 2)Spark 2.x示例在另一个带有完整连接集的答案中更新 spark 2.x支持的操作,带有示例+结果
此外,连接中的重要事项:broadcast function can help to give hint please see my answer
答案 1 :(得分:16)
你可以使用
val resultDf = PersonDf.join(ProfileDf, PersonDf("personId") === ProfileDf("personId"))
或更短且更灵活(因为您可以轻松指定多于1列的连接)
val resultDf = PersonDf.join(ProfileDf,Seq("personId"))
答案 2 :(得分:3)
除了上面的答案外,我还尝试使用spark 2.x here is my linked in article with full examples and explanation 演示具有相同案例类的所有spark连接。
所有联接类型: 默认值为inner
。必须是以下之一:
inner
,cross
,outer
,full
,full_outer
,left
,left_outer
,right
,{{ 1}},right_outer
,left_semi
。
left_anti
结果:
First example inner join +---------------+---+-----------+------------------+ | name|age|profileName|profileDescription| +---------------+---+-----------+------------------+ | Nataraj| 45| Spark| SparkSQLMaster| | Srinivas| 45| Spark| SparkGuru| | Ashik| 22| Spark| DevHunter| | Madhu| 22| Spark| Evangelist| | Meghna| 22| Spark| SparkSQLMaster| | Snigdha| 22| Spark| SparkSQLMaster| | Ravi| 42| Spark| Committer| | Ram| 42| Spark| DevHunter| |Chidananda Raju| 35| Spark| DevHunter| |Sreekanth Doddy| 29| Spark| DevHunter| +---------------+---+-----------+------------------+ all joins in a loop INNER JOIN +--------+---------------+---+-----------+------------------+ |personid| name|age|profileName|profileDescription| +--------+---------------+---+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 2| Meghna| 22| Spark| SparkSQLMaster| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 5| Srinivas| 45| Spark| SparkGuru| | 9| Ram| 42| Spark| DevHunter| | 9| Ashik| 22| Spark| DevHunter| | 9|Chidananda Raju| 35| Spark| DevHunter| | 9|Sreekanth Doddy| 29| Spark| DevHunter| +--------+---------------+---+-----------+------------------+ OUTER JOIN +--------+---------------+----+-----------+------------------+ |personid| name| age|profileName|profileDescription| +--------+---------------+----+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 1| null|null| Spark| All Rounder| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 2| Meghna| 22| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 4| Siddhika| 22| null| null| | 5| Srinivas| 45| Spark| SparkGuru| | 6| Harshita| 22| null| null| | 8| Deekshita| 22| null| null| | 9| Ashik| 22| Spark| DevHunter| | 9| Ram| 42| Spark| DevHunter| | 9|Chidananda Raju| 35| Spark| DevHunter| | 9|Sreekanth Doddy| 29| Spark| DevHunter| +--------+---------------+----+-----------+------------------+ FULL JOIN +--------+---------------+----+-----------+------------------+ |personid| name| age|profileName|profileDescription| +--------+---------------+----+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 1| null|null| Spark| All Rounder| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 2| Meghna| 22| Spark| SparkSQLMaster| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 4| Siddhika| 22| null| null| | 5| Srinivas| 45| Spark| SparkGuru| | 6| Harshita| 22| null| null| | 8| Deekshita| 22| null| null| | 9| Ashik| 22| Spark| DevHunter| | 9| Ram| 42| Spark| DevHunter| | 9|Sreekanth Doddy| 29| Spark| DevHunter| | 9|Chidananda Raju| 35| Spark| DevHunter| +--------+---------------+----+-----------+------------------+ FULL_OUTER JOIN +--------+---------------+----+-----------+------------------+ |personid| name| age|profileName|profileDescription| +--------+---------------+----+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 1| null|null| Spark| All Rounder| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 2| Meghna| 22| Spark| SparkSQLMaster| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 4| Siddhika| 22| null| null| | 5| Srinivas| 45| Spark| SparkGuru| | 6| Harshita| 22| null| null| | 8| Deekshita| 22| null| null| | 9| Ashik| 22| Spark| DevHunter| | 9| Ram| 42| Spark| DevHunter| | 9|Chidananda Raju| 35| Spark| DevHunter| | 9|Sreekanth Doddy| 29| Spark| DevHunter| +--------+---------------+----+-----------+------------------+ LEFT JOIN +--------+---------------+---+-----------+------------------+ |personid| name|age|profileName|profileDescription| +--------+---------------+---+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 2| Meghna| 22| Spark| SparkSQLMaster| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 4| Siddhika| 22| null| null| | 5| Srinivas| 45| Spark| SparkGuru| | 6| Harshita| 22| null| null| | 8| Deekshita| 22| null| null| | 9| Ram| 42| Spark| DevHunter| | 9| Ashik| 22| Spark| DevHunter| | 9|Chidananda Raju| 35| Spark| DevHunter| | 9|Sreekanth Doddy| 29| Spark| DevHunter| +--------+---------------+---+-----------+------------------+ LEFT_OUTER JOIN +--------+---------------+---+-----------+------------------+ |personid| name|age|profileName|profileDescription| +--------+---------------+---+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 2| Meghna| 22| Spark| SparkSQLMaster| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 4| Siddhika| 22| null| null| | 5| Srinivas| 45| Spark| SparkGuru| | 6| Harshita| 22| null| null| | 8| Deekshita| 22| null| null| | 9|Chidananda Raju| 35| Spark| DevHunter| | 9|Sreekanth Doddy| 29| Spark| DevHunter| | 9| Ashik| 22| Spark| DevHunter| | 9| Ram| 42| Spark| DevHunter| +--------+---------------+---+-----------+------------------+ RIGHT JOIN +--------+---------------+----+-----------+------------------+ |personid| name| age|profileName|profileDescription| +--------+---------------+----+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 1| null|null| Spark| All Rounder| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 2| Meghna| 22| Spark| SparkSQLMaster| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 5| Srinivas| 45| Spark| SparkGuru| | 9|Sreekanth Doddy| 29| Spark| DevHunter| | 9|Chidananda Raju| 35| Spark| DevHunter| | 9| Ram| 42| Spark| DevHunter| | 9| Ashik| 22| Spark| DevHunter| +--------+---------------+----+-----------+------------------+ RIGHT_OUTER JOIN +--------+---------------+----+-----------+------------------+ |personid| name| age|profileName|profileDescription| +--------+---------------+----+-----------+------------------+ | 0| Ravi| 42| Spark| Committer| | 1| null|null| Spark| All Rounder| | 2| Meghna| 22| Spark| SparkSQLMaster| | 2| Snigdha| 22| Spark| SparkSQLMaster| | 2| Nataraj| 45| Spark| SparkSQLMaster| | 3| Madhu| 22| Spark| Evangelist| | 5| Srinivas| 45| Spark| SparkGuru| | 9|Sreekanth Doddy| 29| Spark| DevHunter| | 9| Ashik| 22| Spark| DevHunter| | 9|Chidananda Raju| 35| Spark| DevHunter| | 9| Ram| 42| Spark| DevHunter| +--------+---------------+----+-----------+------------------+ LEFT_SEMI JOIN +--------+---------------+---+ |personid| name|age| +--------+---------------+---+ | 0| Ravi| 42| | 2| Nataraj| 45| | 2| Meghna| 22| | 2| Snigdha| 22| | 3| Madhu| 22| | 5| Srinivas| 45| | 9|Chidananda Raju| 35| | 9|Sreekanth Doddy| 29| | 9| Ram| 42| | 9| Ashik| 22| +--------+---------------+---+ LEFT_ANTI JOIN +--------+---------+---+ |personid| name|age| +--------+---------+---+ | 4| Siddhika| 22| | 6| Harshita| 22| | 8|Deekshita| 22| +--------+---------+---+ Till 1.x cross join is : df_asPerson.join(df_asProfile) Explicit Cross Join in 2.x : http://blog.madhukaraphatak.com/migrating-to-spark-two-part-4/ Cartesian joins are very expensive without an extra filter that can be pushed down. cross join or cartesian product +---------------+---+--------+-----------+--------+------------------+ |name |age|personid|profileName|personid|profileDescription| +---------------+---+--------+-----------+--------+------------------+ |Nataraj |45 |2 |Spark |2 |SparkSQLMaster | |Nataraj |45 |2 |Spark |5 |SparkGuru | |Nataraj |45 |2 |Spark |9 |DevHunter | |Nataraj |45 |2 |Spark |3 |Evangelist | |Nataraj |45 |2 |Spark |0 |Committer | |Nataraj |45 |2 |Spark |1 |All Rounder | |Srinivas |45 |5 |Spark |2 |SparkSQLMaster | |Srinivas |45 |5 |Spark |5 |SparkGuru | |Srinivas |45 |5 |Spark |9 |DevHunter | |Srinivas |45 |5 |Spark |3 |Evangelist | |Srinivas |45 |5 |Spark |0 |Committer | |Srinivas |45 |5 |Spark |1 |All Rounder | |Ashik |22 |9 |Spark |2 |SparkSQLMaster | |Ashik |22 |9 |Spark |5 |SparkGuru | |Ashik |22 |9 |Spark |9 |DevHunter | |Ashik |22 |9 |Spark |3 |Evangelist | |Ashik |22 |9 |Spark |0 |Committer | |Ashik |22 |9 |Spark |1 |All Rounder | |Deekshita |22 |8 |Spark |2 |SparkSQLMaster | |Deekshita |22 |8 |Spark |5 |SparkGuru | |Deekshita |22 |8 |Spark |9 |DevHunter | |Deekshita |22 |8 |Spark |3 |Evangelist | |Deekshita |22 |8 |Spark |0 |Committer | |Deekshita |22 |8 |Spark |1 |All Rounder | |Siddhika |22 |4 |Spark |2 |SparkSQLMaster | |Siddhika |22 |4 |Spark |5 |SparkGuru | |Siddhika |22 |4 |Spark |9 |DevHunter | |Siddhika |22 |4 |Spark |3 |Evangelist | |Siddhika |22 |4 |Spark |0 |Committer | |Siddhika |22 |4 |Spark |1 |All Rounder | |Madhu |22 |3 |Spark |2 |SparkSQLMaster | |Madhu |22 |3 |Spark |5 |SparkGuru | |Madhu |22 |3 |Spark |9 |DevHunter | |Madhu |22 |3 |Spark |3 |Evangelist | |Madhu |22 |3 |Spark |0 |Committer | |Madhu |22 |3 |Spark |1 |All Rounder | |Meghna |22 |2 |Spark |2 |SparkSQLMaster | |Meghna |22 |2 |Spark |5 |SparkGuru | |Meghna |22 |2 |Spark |9 |DevHunter | |Meghna |22 |2 |Spark |3 |Evangelist | |Meghna |22 |2 |Spark |0 |Committer | |Meghna |22 |2 |Spark |1 |All Rounder | |Snigdha |22 |2 |Spark |2 |SparkSQLMaster | |Snigdha |22 |2 |Spark |5 |SparkGuru | |Snigdha |22 |2 |Spark |9 |DevHunter | |Snigdha |22 |2 |Spark |3 |Evangelist | |Snigdha |22 |2 |Spark |0 |Committer | |Snigdha |22 |2 |Spark |1 |All Rounder | |Harshita |22 |6 |Spark |2 |SparkSQLMaster | |Harshita |22 |6 |Spark |5 |SparkGuru | |Harshita |22 |6 |Spark |9 |DevHunter | |Harshita |22 |6 |Spark |3 |Evangelist | |Harshita |22 |6 |Spark |0 |Committer | |Harshita |22 |6 |Spark |1 |All Rounder | |Ravi |42 |0 |Spark |2 |SparkSQLMaster | |Ravi |42 |0 |Spark |5 |SparkGuru | |Ravi |42 |0 |Spark |9 |DevHunter | |Ravi |42 |0 |Spark |3 |Evangelist | |Ravi |42 |0 |Spark |0 |Committer | |Ravi |42 |0 |Spark |1 |All Rounder | |Ram |42 |9 |Spark |2 |SparkSQLMaster | |Ram |42 |9 |Spark |5 |SparkGuru | |Ram |42 |9 |Spark |9 |DevHunter | |Ram |42 |9 |Spark |3 |Evangelist | |Ram |42 |9 |Spark |0 |Committer | |Ram |42 |9 |Spark |1 |All Rounder | |Chidananda Raju|35 |9 |Spark |2 |SparkSQLMaster | |Chidananda Raju|35 |9 |Spark |5 |SparkGuru | |Chidananda Raju|35 |9 |Spark |9 |DevHunter | |Chidananda Raju|35 |9 |Spark |3 |Evangelist | |Chidananda Raju|35 |9 |Spark |0 |Committer | |Chidananda Raju|35 |9 |Spark |1 |All Rounder | |Sreekanth Doddy|29 |9 |Spark |2 |SparkSQLMaster | |Sreekanth Doddy|29 |9 |Spark |5 |SparkGuru | |Sreekanth Doddy|29 |9 |Spark |9 |DevHunter | |Sreekanth Doddy|29 |9 |Spark |3 |Evangelist | |Sreekanth Doddy|29 |9 |Spark |0 |Committer | |Sreekanth Doddy|29 |9 |Spark |1 |All Rounder | +---------------+---+--------+-----------+--------+------------------+ == Physical Plan == BroadcastNestedLoopJoin BuildRight, Cross :- LocalTableScan [name#0, age#1, personid#2] +- BroadcastExchange IdentityBroadcastMode +- LocalTableScan [profileName#7, personid#8, profileDescription#9] () 78 createOrReplaceTempView example Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the [[SparkSession]] that was used to create this Dataset. createOrReplaceTempView sql SELECT dfperson.name , dfperson.age , dfprofile.profileDescription FROM dfperson JOIN dfprofile ON dfperson.personid == dfprofile.personid +---------------+---+------------------+ | name|age|profileDescription| +---------------+---+------------------+ | Nataraj| 45| SparkSQLMaster| | Srinivas| 45| SparkGuru| | Ashik| 22| DevHunter| | Madhu| 22| Evangelist| | Meghna| 22| SparkSQLMaster| | Snigdha| 22| SparkSQLMaster| | Ravi| 42| Committer| | Ram| 42| DevHunter| |Chidananda Raju| 35| DevHunter| |Sreekanth Doddy| 29| DevHunter| +---------------+---+------------------+ **** EXCEPT DEMO *** df_asPerson.except(df_asProfile) Except demo +---------------+---+--------+ | name|age|personid| +---------------+---+--------+ | Ashik| 22| 9| | Harshita| 22| 6| | Madhu| 22| 3| | Ram| 42| 9| | Ravi| 42| 0| |Chidananda Raju| 35| 9| | Siddhika| 22| 4| | Srinivas| 45| 5| |Sreekanth Doddy| 29| 9| | Deekshita| 22| 8| | Meghna| 22| 2| | Snigdha| 22| 2| | Nataraj| 45| 2| +---------------+---+--------+ df_asProfile.except(df_asPerson) Except demo +-----------+--------+------------------+ |profileName|personid|profileDescription| +-----------+--------+------------------+ | Spark| 5| SparkGuru| | Spark| 9| DevHunter| | Spark| 2| SparkSQLMaster| | Spark| 3| Evangelist| | Spark| 0| Committer| | Spark| 1| All Rounder| +-----------+--------+------------------+
答案 3 :(得分:1)
从https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html开始,使用join
:
使用给定列与另一个DataFrame进行内部equi-join。
PersonDf.join(ProfileDf,$"personId")
OR
PersonDf.join(ProfileDf,PersonDf("personId") === ProfileDf("personId"))
<强>更新强>
您还可以使用DFs
将df.registerTempTable("tableName")
保存为临时表,并且可以使用sqlContext
编写SQL查询。
答案 4 :(得分:0)
一种方式
// join type can be inner, left, right, fullouter
val mergedDf = df1.join(df2, Seq("keyCol"), "inner")
// keyCol can be multiple column names seperated by comma
val mergedDf = df1.join(df2, Seq("keyCol1", "keyCol2"), "left")
另一种方式
import spark.implicits._
val mergedDf = df1.as("d1").join(df2.as("d2"), ($"d1.colName" === $"d2.colName"))
// to select specific columns as output
val mergedDf = df1.as("d1").join(df2.as("d2"), ($"d1.colName" === $"d2.colName")).select($"d1.*", $"d2.anotherColName")
答案 5 :(得分:0)
与scala的内部加入
val joinedDataFrame = PersonDf.join(ProfileDf ,"personId")
joinedDataFrame.show
答案 6 :(得分:0)
发布基于Java的解决方案,以防您的团队仅使用Java。关键字inner
将确保最终数据帧中仅存在匹配的行。
Dataset<Row> joined = PersonDf.join(ProfileDf,
PersonDf.col("personId").equalTo(ProfileDf.col("personId")),
"inner");
joined.show();
答案 7 :(得分:0)
让我举例说明
创建emp数据框
导入spark.sqlContext.implicits._ val emp = Seq((1,“ Smith”,-1,“ 2018”,“ 10”,“ M”,3000), (2,“ Rose”,1,“ 2010”,“ 20”,“ M”,4000), (3,“威廉姆斯”,1,“ 2010”,“ 10”,“ M”,1000), (4,“ Jones”,2,“ 2005”,“ 10”,“ F”,2000), (5,“棕色”,2,“ 2010”,“ 40”,“”,-1), (6,“棕色”,2,“ 2010”,“ 50”,“”,-1) ) val empColumns = Seq(“ emp_id”,“ name”,“ superior_emp_id”,“ year_joined”, “ emp_dept_id”,“性别”,“工资”)
val empDF = emp.toDF(empColumns:_ *)
创建部门DataFrame
val dept = Seq((“ Finance”,10), (“营销”,20), (“销售”,30), (“ IT”,40) )val deptColumns = Seq(“ dept_name”,“ dept_id”) val deptDF = dept.toDF(deptColumns:_ *)
现在让我们将emp.emp_dept_id与dept.dept_id结合起来
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
以下结果
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
如果您正在使用示例查找python PySpark Join,并在Spark Join上找到完整的Scala示例